首页 > 最新文献

ACM Transactions on Database Systems (TODS)最新文献

英文 中文
On Finding Rank Regret Representatives 关于寻找等级遗憾代表
Pub Date : 2022-08-18 DOI: 10.1145/3531054
Abolfazl Asudeh, Gautam Das, H. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, N. Zhang, Jianwen Zhao
Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and instead just limit the users’ “regret.” Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function. However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in two-dimensional space but is NP-hard on three or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, and so on. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.
在数据集中选择最佳项是数据探索中的常见任务。然而,“最佳”的概念存在于观察者的眼中:不同的用户可能认为不同的属性更重要,因此得出不同的排名。然而,我们可以删除“主导”项目,并创建数据的“代表性”子集,包括其中的“最佳项目”。帕累托最优代表保证包含每个可能排名的最佳项,但它可能是数据的很大一部分。如果我们放宽为每个用户提供最佳商品的要求,而不是仅仅限制用户的“后悔”,那么代表性就会小得多。现有的工作将遗憾定义为对任何选择的排名函数,通过限制对代表性而不是完整数据集的考虑而导致的分数损失。然而,分数通常不是一个有意义的数字,用户可能不理解它的绝对值。有时分数的小范围可以包含数据集的很大一部分。相比之下,用户确实理解排名排序的概念。因此,我们在定义后悔时考虑了项目在排名列表中的位置,并提出了排名-后悔代表作为包含任何可能的排名函数的top-k中至少一个的数据的最小子集。这个问题在二维空间中是多项式时间可解的,但在三维或更多维度上是np困难的。我们设计了一套算法来满足不同的目的,例如是否允许k松弛,结果大小,或两者兼有,是否已知分布,理论保证或实际效率是重要的,等等。在真实数据集上的实验表明,我们可以有效地找到具有小秩遗憾的小子集。
{"title":"On Finding Rank Regret Representatives","authors":"Abolfazl Asudeh, Gautam Das, H. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, N. Zhang, Jianwen Zhao","doi":"10.1145/3531054","DOIUrl":"https://doi.org/10.1145/3531054","url":null,"abstract":"Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and instead just limit the users’ “regret.” Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function. However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in two-dimensional space but is NP-hard on three or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, and so on. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"32 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82390719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration 用随机访问和随机顺序枚举回答连接查询的并集
Pub Date : 2022-06-25 DOI: 10.1145/3531055
Nofar Carmeli, Shai Zeevi, Christoph Berkholz, A. Conte, B. Kimelfeld, Nicole Schweikardt
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmic-delay enumeration. In this article, we seek a structure that supports the more demanding task of a “random permutation”: polylogarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically meaningful manner. An even more demanding task is that of “random access”: polylogarithmic-time retrieval of an answer whose position is given. We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. We identify a fragment of such UCQs where we can guarantee random access with polylogarithmic access time (and linear-time preprocessing) and a more general fragment where we can guarantee tractable random permutation. For general unions of free-connex acyclic CQs, we devise two algorithms with relaxed guarantees: one has logarithmic delay in expectation, and the other provides a permutation that is almost uniformly distributed. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our random-order enumeration approach over state-of-the-art alternatives.
随着数据分析对数字系统变得越来越重要,对数据库查询进行特征化以进行更有效的评估也变得越来越重要。我们考虑了经过线性时间预处理阶段后具有多对数延迟的回答枚举的可追溯性尺度。通过在预处理阶段构造一个支持多对数延迟枚举的数据结构,可以获得这样的求值。在本文中,我们寻求一种结构来支持更苛刻的“随机排列”任务:真正随机顺序的多对数延迟枚举。如果下游应用程序假设中间结果以统计上有意义的方式代表整个结果集,则需要进行这种枚举。一个要求更高的任务是“随机存取”:用多对数时间检索给定位置的答案。证明了自由连通无环cq在枚举、随机顺序枚举和随机访问三种意义上都是可处理的;在没有自连接的情况下,从过去的结果可以得出,其他CQ对这三个CQ都是难以处理的(在一些细粒度的复杂性假设下)。然而,在cq的并集(UCQ)的情况下,这三个尺度是分开的:虽然自由连接的无环cq的并集具有可处理的枚举,但它可能(可证明地)不允许随机访问。我们确定了这样的ucq的一个片段,我们可以保证随机访问的多对数访问时间(和线性时间预处理)和一个更一般的片段,我们可以保证可处理的随机排列。对于自由连接无环cq的一般并集,我们设计了两种具有宽松保证的算法:一种算法具有对数期望延迟,另一种算法提供了几乎均匀分布的排列。最后,我们提出了一个实现和实证研究,表明我们的随机顺序枚举方法比最先进的替代方法具有相当大的实际优势。
{"title":"Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration","authors":"Nofar Carmeli, Shai Zeevi, Christoph Berkholz, A. Conte, B. Kimelfeld, Nicole Schweikardt","doi":"10.1145/3531055","DOIUrl":"https://doi.org/10.1145/3531055","url":null,"abstract":"As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmic-delay enumeration. In this article, we seek a structure that supports the more demanding task of a “random permutation”: polylogarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically meaningful manner. An even more demanding task is that of “random access”: polylogarithmic-time retrieval of an answer whose position is given. We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. We identify a fragment of such UCQs where we can guarantee random access with polylogarithmic access time (and linear-time preprocessing) and a more general fragment where we can guarantee tractable random permutation. For general unions of free-connex acyclic CQs, we devise two algorithms with relaxed guarantees: one has logarithmic delay in expectation, and the other provides a permutation that is almost uniformly distributed. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our random-order enumeration approach over state-of-the-art alternatives.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"33 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2022-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76987727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Persistent Summaries 持续的总结
Pub Date : 2022-05-23 DOI: 10.1145/3531053
Tian Zeng, Zhewei Wei, Ge Luo, K. Yi, Xiaoyong Du, Ji-Rong Wen
A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried. Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today’s big data applications, in particular, for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which uses only sublinear space by sacrificing slightly on accuracy. All streaming algorithms work by maintaining a small data structure in memory, which is usually called a sketch, summary, or synopsis. The summary is updated upon the arrival of every element in the stream, thus it is ephemeral, meaning that it can only answer queries about the current status of the stream. In this article, we aim at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.
持久数据结构,在数据库文献中也称为多版本数据结构,是一种随着时间的推移而更新时保留其所有以前版本的数据结构。对数据结构的每次更新(插入、删除或更改数据记录)都会创建一个新版本,而所有版本都保留在数据结构中,以便仍然可以查询以前的任何版本。持久数据结构的目标是准确地记录所有版本,这导致空间需求至少与更新次数成线性关系。在当今的许多大数据应用中,特别是对于高速流数据,数据的数量和速度是如此之高,以至于我们无法承担存储所有数据的费用。因此,流算法以牺牲精度为代价,只占用亚线性空间,受到了研究界的广泛关注。所有流算法都是通过在内存中维护一个小的数据结构来工作的,这个数据结构通常被称为草图、摘要或概要。摘要在流中的每个元素到达时更新,因此它是短暂的,这意味着它只能回答关于流当前状态的查询。在本文中,我们的目标是设计持久的摘要,从而使流算法能够在任何先前的时间回答关于流的查询。
{"title":"Persistent Summaries","authors":"Tian Zeng, Zhewei Wei, Ge Luo, K. Yi, Xiaoyong Du, Ji-Rong Wen","doi":"10.1145/3531053","DOIUrl":"https://doi.org/10.1145/3531053","url":null,"abstract":"A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried. Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today’s big data applications, in particular, for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which uses only sublinear space by sacrificing slightly on accuracy. All streaming algorithms work by maintaining a small data structure in memory, which is usually called a sketch, summary, or synopsis. The summary is updated upon the arrival of every element in the stream, thus it is ephemeral, meaning that it can only answer queries about the current status of the stream. In this article, we aim at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"103 1","pages":"1 - 42"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88320936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Influence Maximization Revisited: Efficient Sampling with Bound Tightened 影响最大化再论:边界收紧的有效采样
Pub Date : 2022-05-19 DOI: 10.1145/3533817
Qintian Guo, Sibo Wang, Zhewei Wei, Wenqing Lin, Jing Tang
Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log n/ε2) expected time while returning a (1 - 1/e - ε) approximate solution with at least 1 - 1/n probability. A key phase of these IM algorithms is the random reverse reachable (RR) set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms. In this article, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k ċ n log n ċ2), when for any node v, the total weight of its incoming edges is no larger than a constant. For the general IC model where the weights are skewed, we present a sampling algorithm SKIP. To the best of our knowledge, it is the first index-free algorithm that achieves the optimal time complexity of the sorted subset sampling problem. Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment. Besides, we investigate the issues of forward propagation and derive its time complexity with our proposed subset sampling techniques. We also present a heuristic condition to indicate when the forward propagation approach should be utilized to estimate the expected influence of a given seed set.
给定一个有n个节点和m条边的社交网络G,一个正整数k和一个级联模型C,影响最大化(IM)问题要求在G中有k个节点,使得在级联模型C下受k个节点影响的节点的期望数量最大化。最先进的近似解在O(k(n+m)log n/ε2)预期时间内运行,同时返回(1 - 1/e - ε)近似解,概率至少为1 - 1/n。随机反向可达集的生成是IM算法的一个关键阶段,这一阶段对当前IM算法的效率和可扩展性有重要影响。本文对这一关键阶段进行了研究,提出了一种高效的IC模型下随机RR集生成算法。利用新算法,我们证明了现有IM算法在IC模型下的期望运行时间可以提高到O(k * * n log n ċ2),对于任何节点v,其传入边的总权重不大于一个常数。对于权重偏斜的一般集成电路模型,我们提出了一种SKIP采样算法。据我们所知,它是第一个实现排序子集采样问题最优时间复杂度的无索引算法。此外,现有的近似IM算法在高影响力网络中存在可扩展性问题,其中随机RR集的大小通常相当大。我们通过在不牺牲近似保证的情况下减少随机RR集的平均大小来解决这个具有挑战性的问题。正如我们的实验所示,所提出的解决方案比目前的技术状态快了几个数量级。此外,我们还研究了前向传播问题,并利用我们提出的子集采样技术推导了前向传播的时间复杂度。我们还提出了一个启发式条件,以指示何时应使用前向传播方法来估计给定种子集的预期影响。
{"title":"Influence Maximization Revisited: Efficient Sampling with Bound Tightened","authors":"Qintian Guo, Sibo Wang, Zhewei Wei, Wenqing Lin, Jing Tang","doi":"10.1145/3533817","DOIUrl":"https://doi.org/10.1145/3533817","url":null,"abstract":"Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log n/ε2) expected time while returning a (1 - 1/e - ε) approximate solution with at least 1 - 1/n probability. A key phase of these IM algorithms is the random reverse reachable (RR) set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms. In this article, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k ċ n log n ċ2), when for any node v, the total weight of its incoming edges is no larger than a constant. For the general IC model where the weights are skewed, we present a sampling algorithm SKIP. To the best of our knowledge, it is the first index-free algorithm that achieves the optimal time complexity of the sorted subset sampling problem. Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment. Besides, we investigate the issues of forward propagation and derive its time complexity with our proposed subset sampling techniques. We also present a heuristic condition to indicate when the forward propagation approach should be utilized to estimate the expected influence of a given seed set.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73446774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Unified Route Planning for Shared Mobility: An Insertion-based Framework 共享出行的统一路径规划:基于插入的框架
Pub Date : 2022-03-31 DOI: 10.1145/3488723
Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Lei Chen, Ke Xu
There has been a dramatic growth of shared mobility applications such as ride-sharing, food delivery, and crowdsourced parcel delivery. Shared mobility refers to transportation services that are shared among users, where a central issue is route planning. Given a set of workers and requests, route planning finds for each worker a route, i.e., a sequence of locations to pick up and drop off passengers/parcels that arrive from time to time, with different optimization objectives. Previous studies lack practicability due to their conflicted objectives and inefficiency in inserting a new request into a route, a basic operation called insertion. In addition, previous route planning solutions fail to exploit the appearance patterns of future requests hidden in historical data for optimization. In this paper, we present a unified formulation of route planning called URPSM. It has a well-defined parameterized objective function which eliminates the contradicted objectives in previous studies and enables flexible multi-objective route planning for shared mobility. We propose two insertion-based frameworks to solve the URPSM problem. The first is built upon the plain-insertion widely used in prior studies, which processes online requests only, whereas the second relies on a new insertion operator called prophet-insertion that handles both online and predicted requests. Novel dynamic programming algorithms are designed to accelerate both insertions to only linear time. Theoretical analysis shows that no online algorithm can have a constant competitive ratio for the URPSM problem under the competitive analysis model, yet our prophet-insertion-based framework can achieve a constant optimality ratio under the instance-optimality model. Extensive experimental results on real datasets show that our insertion-based solutions outperform the state-of-the-art algorithms in both effectiveness and efficiency by a large margin (e.g., up to 30 ( times ) more effective in the objective and up to 20 ( times ) faster).
拼车、送餐和众包包裹等共享移动应用急剧增长。共享出行指的是用户之间共享的交通服务,其核心问题是路线规划。给定一组工人和请求,路线规划为每个工人找到一条路线,即一系列地点,以不同的优化目标接送不时到达的乘客/包裹。以往的研究缺乏实用性,因为它们的目标相互冲突,并且在将新请求插入路由(一种称为插入的基本操作)时效率低下。此外,以前的路由规划方案无法利用隐藏在历史数据中的未来请求的外观模式进行优化。在本文中,我们提出了一种统一的路由规划公式,称为URPSM。它具有定义良好的参数化目标函数,消除了以往研究中存在的目标矛盾,实现了共享出行的灵活多目标路径规划。我们提出了两个基于插入的框架来解决URPSM问题。前者建立在先前研究中广泛使用的纯插入的基础上,它只处理在线请求,而后者则依赖于一种新的插入算子,称为预言插入,它既处理在线请求,也处理预测请求。设计了一种新的动态规划算法,将插入时间缩短到线性时间。理论分析表明,在竞争分析模型下,URPSM问题的在线算法不可能具有恒定的竞争比,而我们的基于先知插入的框架在实例最优性模型下可以实现恒定的最优比。在真实数据集上的大量实验结果表明,我们基于插入的解决方案在有效性和效率方面都大大优于最先进的算法(例如,在目标方面效率高达30 ( times ),速度高达20 ( times ))。
{"title":"Unified Route Planning for Shared Mobility: An Insertion-based Framework","authors":"Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Lei Chen, Ke Xu","doi":"10.1145/3488723","DOIUrl":"https://doi.org/10.1145/3488723","url":null,"abstract":"There has been a dramatic growth of shared mobility applications such as ride-sharing, food delivery, and crowdsourced parcel delivery. Shared mobility refers to transportation services that are shared among users, where a central issue is route planning. Given a set of workers and requests, route planning finds for each worker a route, i.e., a sequence of locations to pick up and drop off passengers/parcels that arrive from time to time, with different optimization objectives. Previous studies lack practicability due to their conflicted objectives and inefficiency in inserting a new request into a route, a basic operation called insertion. In addition, previous route planning solutions fail to exploit the appearance patterns of future requests hidden in historical data for optimization. In this paper, we present a unified formulation of route planning called URPSM. It has a well-defined parameterized objective function which eliminates the contradicted objectives in previous studies and enables flexible multi-objective route planning for shared mobility. We propose two insertion-based frameworks to solve the URPSM problem. The first is built upon the plain-insertion widely used in prior studies, which processes online requests only, whereas the second relies on a new insertion operator called prophet-insertion that handles both online and predicted requests. Novel dynamic programming algorithms are designed to accelerate both insertions to only linear time. Theoretical analysis shows that no online algorithm can have a constant competitive ratio for the URPSM problem under the competitive analysis model, yet our prophet-insertion-based framework can achieve a constant optimality ratio under the instance-optimality model. Extensive experimental results on real datasets show that our insertion-based solutions outperform the state-of-the-art algorithms in both effectiveness and efficiency by a large margin (e.g., up to 30 ( times ) more effective in the objective and up to 20 ( times ) faster).","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"6 1","pages":"1 - 48"},"PeriodicalIF":0.0,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86740775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
The Space-Efficient Core of Vadalog Vadalog的空间效率核心
Pub Date : 2022-03-31 DOI: 10.1145/3488720
Gerald Berger, G. Gottlob, Andreas Pieris, Emanuel Sallinger
Vadalog is a system for performing complex reasoning tasks such as those required in advanced knowledge graphs. The logical core of the underlying Vadalog language is the warded fragment of tuple-generating dependencies (TGDs). This formalism ensures tractable reasoning in data complexity, while a recent analysis focusing on a practical implementation led to the reasoning algorithm around which the Vadalog system is built. A fundamental question that has emerged in the context of Vadalog is whether we can limit the recursion allowed by wardedness in order to obtain a formalism that provides a convenient syntax for expressing useful recursive statements, and at the same time achieves space-efficiency. After analyzing several real-life examples of warded sets of TGDs provided by our industrial partners, as well as recent benchmarks, we observed that recursion is often used in a restricted way: the body of a TGD contains at most one atom whose predicate is mutually recursive with a predicate in the head. We show that this type of recursion, known as piece-wise linear in the Datalog literature, is the answer to our main question. We further show that piece-wise linear recursion alone, without the wardedness condition, is not enough as it leads to undecidability. We also study the relative expressiveness of the query languages based on (piece-wise linear) warded sets of TGDs. Finally, we give preliminary experimental evidence for the practical effect of piece-wise linearity on Vadalog.
Vadalog是一个用于执行复杂推理任务的系统,例如高级知识图中需要的那些任务。底层Vadalog语言的逻辑核心是元组生成依赖关系(tgd)的守护片段。这种形式确保了数据复杂性中的可处理推理,而最近的一项分析侧重于实际实现,导致了Vadalog系统构建的推理算法。在Vadalog上下文中出现的一个基本问题是,我们是否可以限制ward所允许的递归,以获得一种形式,这种形式为表达有用的递归语句提供方便的语法,同时实现空间效率。在分析了由我们的行业合作伙伴提供的几个实际示例以及最近的基准测试后,我们发现递归通常以一种受限的方式使用:TGD的主体最多包含一个原子,其谓词与头部中的谓词相互递归。我们证明这种递归,在Datalog文献中称为分段线性,是我们主要问题的答案。我们进一步表明,单独的分段线性递归,没有ward条件,是不够的,因为它会导致不可判定。我们还研究了基于(分段线性)tgd防护集的查询语言的相对表达性。最后,我们给出了分段线性对Vadalog的实际影响的初步实验证据。
{"title":"The Space-Efficient Core of Vadalog","authors":"Gerald Berger, G. Gottlob, Andreas Pieris, Emanuel Sallinger","doi":"10.1145/3488720","DOIUrl":"https://doi.org/10.1145/3488720","url":null,"abstract":"Vadalog is a system for performing complex reasoning tasks such as those required in advanced knowledge graphs. The logical core of the underlying Vadalog language is the warded fragment of tuple-generating dependencies (TGDs). This formalism ensures tractable reasoning in data complexity, while a recent analysis focusing on a practical implementation led to the reasoning algorithm around which the Vadalog system is built. A fundamental question that has emerged in the context of Vadalog is whether we can limit the recursion allowed by wardedness in order to obtain a formalism that provides a convenient syntax for expressing useful recursive statements, and at the same time achieves space-efficiency. After analyzing several real-life examples of warded sets of TGDs provided by our industrial partners, as well as recent benchmarks, we observed that recursion is often used in a restricted way: the body of a TGD contains at most one atom whose predicate is mutually recursive with a predicate in the head. We show that this type of recursion, known as piece-wise linear in the Datalog literature, is the answer to our main question. We further show that piece-wise linear recursion alone, without the wardedness condition, is not enough as it leads to undecidability. We also study the relative expressiveness of the query languages based on (piece-wise linear) warded sets of TGDs. Finally, we give preliminary experimental evidence for the practical effect of piece-wise linearity on Vadalog.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"18 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82701397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Height Optimized Tries 高度优化尝试
Pub Date : 2022-03-31 DOI: 10.1145/3506692
Robert Binna, Eva Zangerle, M. Pichl, Günther Specht, Viktor Leis
We present the Height Optimized Trie (HOT), a fast and space-efficient in-memory index structure. The core algorithmic idea of HOT is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. For a fixed maximum node fanout, the overall tree height is minimal and its structure is deterministically defined. Multiple carefully engineered node implementations using SIMD instructions or lightweight compression schemes provide compactness and fast search and optimize HOT structures for different usage scenarios. Our experiments, which use a wide variety of workloads and data sets, show that HOT outperforms other state-of-the-art index structures for string keys both in terms of search performance and memory footprint, while being competitive for integer keys.
我们提出了高度优化Trie (HOT),一种快速且空间高效的内存索引结构。HOT的核心算法思想是动态地改变每个节点考虑的比特数,从而实现始终如一的高扇出,从而获得良好的缓存效率。对于一个固定的最大节点扇出,整个树的高度是最小的,其结构是确定性定义的。使用SIMD指令或轻量级压缩方案的多个精心设计的节点实现提供了紧凑性和快速搜索,并针对不同的使用场景优化了HOT结构。我们的实验使用了各种各样的工作负载和数据集,结果表明,对于字符串键,HOT在搜索性能和内存占用方面优于其他最先进的索引结构,而对于整数键则具有竞争力。
{"title":"Height Optimized Tries","authors":"Robert Binna, Eva Zangerle, M. Pichl, Günther Specht, Viktor Leis","doi":"10.1145/3506692","DOIUrl":"https://doi.org/10.1145/3506692","url":null,"abstract":"We present the Height Optimized Trie (HOT), a fast and space-efficient in-memory index structure. The core algorithmic idea of HOT is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. For a fixed maximum node fanout, the overall tree height is minimal and its structure is deterministically defined. Multiple carefully engineered node implementations using SIMD instructions or lightweight compression schemes provide compactness and fast search and optimize HOT structures for different usage scenarios. Our experiments, which use a wide variety of workloads and data sets, show that HOT outperforms other state-of-the-art index structures for string keys both in terms of search performance and memory footprint, while being competitive for integer keys.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85009160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation Methods 数据不确定性下保序子矩阵的挖掘:一种可能世界方法和有效逼近方法
Pub Date : 2022-03-31 DOI: 10.1145/3524915
Ji Cheng, Da Yan, Wenwen Qu, Xiaotian Hao, Cheng Long, Wilfred Ng, Xiaoling Wang
Given a data matrix ( D ) , a submatrix ( S ) of ( D ) is an order-preserving submatrix (OPSM) if there is a permutation of the columns of ( S ) , under which the entry values of each row in ( S ) are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work on OPSM has ever considered uncertain value intervals using the well-established possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support-based and (2) probabilistic frequentness-based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, with a closed-form formula derived to efficiently handle the special case of uniform value distribution and an accurate cubic spline approximation approach that works well with any uncertain value distributions. To efficiently check the probabilistic frequentness, several effective pruning rules are designed to efficiently prune insignificant OPSMs; two approximation techniques based on the Poisson and Gaussian distributions, respectively, are proposed for further speedup. These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori, respectively. We further parallelize our prefix-projection-based mining algorithm using PrefixFPM, a recently proposed framework for parallel frequent pattern mining, and we achieve a good speedup with the number of CPU cores. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.
给定一个数据矩阵( D ),如果存在( S )列的置换,则( D )的子矩阵( S )是一个保序子矩阵(OPSM),在这种置换下,( S )中每一行的条目值都严格递增。OPSM挖掘广泛应用于现实生活中,如识别共表达基因和寻找具有相似偏好的客户。然而,由于实验条件和测量误差的变化,噪声在真实数据矩阵中普遍存在,使得传统的OPSM挖掘算法无法适用。以前在OPSM上的工作从来没有使用公认的可能世界语义来考虑不确定值区间。我们基于可能世界语义建立了两种不同的重要opsm定义:(1)基于期望支持度和(2)基于概率频率。提出了一种优化的动态规划方法来计算行支持特定列排列的概率,推导了一个封闭公式来有效地处理均匀值分布的特殊情况,以及一个精确的三次样条近似方法,可以很好地处理任何不确定值分布。为了有效地检查概率频率,设计了几种有效的剪枝规则来有效地剪枝不重要的opsm;分别提出了基于泊松分布和高斯分布的两种近似技术来进一步提高速度。这些技术被集成到我们的两个OPSM挖掘算法中,分别基于前缀投影和Apriori。我们使用最近提出的并行频繁模式挖掘框架PrefixFPM进一步并行化基于前缀投影的挖掘算法,并在CPU内核数量上实现了良好的加速。在实际微阵列数据上的大量实验表明,我们的算法发现的opsm比现有方法发现的opsm质量高得多。
{"title":"Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation Methods","authors":"Ji Cheng, Da Yan, Wenwen Qu, Xiaotian Hao, Cheng Long, Wilfred Ng, Xiaoling Wang","doi":"10.1145/3524915","DOIUrl":"https://doi.org/10.1145/3524915","url":null,"abstract":"Given a data matrix ( D ) , a submatrix ( S ) of ( D ) is an order-preserving submatrix (OPSM) if there is a permutation of the columns of ( S ) , under which the entry values of each row in ( S ) are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work on OPSM has ever considered uncertain value intervals using the well-established possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support-based and (2) probabilistic frequentness-based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, with a closed-form formula derived to efficiently handle the special case of uniform value distribution and an accurate cubic spline approximation approach that works well with any uncertain value distributions. To efficiently check the probabilistic frequentness, several effective pruning rules are designed to efficiently prune insignificant OPSMs; two approximation techniques based on the Poisson and Gaussian distributions, respectively, are proposed for further speedup. These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori, respectively. We further parallelize our prefix-projection-based mining algorithm using PrefixFPM, a recently proposed framework for parallel frequent pattern mining, and we achieve a good speedup with the number of CPU cores. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"25 1","pages":"1 - 57"},"PeriodicalIF":0.0,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88400335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Incremental Graph Computations: Doable and Undoable 增量图计算:可行和不可行的
Pub Date : 2022-03-10 DOI: 10.1145/3500930
W. Fan, Chao Tian
The incremental problem for a class ( {mathcal {Q}} ) of graph queries aims to compute, given a query ( Q in {mathcal {Q}} ) , graph G, answers Q(G) to Q in G and updates ΔG to G as input, changes ΔO to output Q(G) such that Q(G⊕ΔG) = Q(G)⊕ΔO. It is called bounded if its cost can be expressed as a polynomial function in the sizes of Q, ΔG and ΔO, which reduces the computations on possibly big G to small ΔG and ΔO. No matter how desirable, however, our first results are negative: For common graph queries such as traversal, connectivity, keyword search, pattern matching, and maximum cardinality matching, their incremental problems are unbounded. In light of the negative results, we propose two characterizations for the effectiveness of incremental graph computation: (a) localizable, if its cost is decided by small neighbors of nodes in ΔG instead of the entire G; and (b) bounded relative to a batch graph algorithm ( {mathcal {T}} ) , if the cost is determined by the sizes of ΔG and changes to the affected area that is necessarily checked by any algorithms that incrementalize ( {mathcal {T}} ) . We show that the incremental computations above are either localizable or relatively bounded by providing corresponding incremental algorithms. That is, we can either reduce the incremental computations on big graphs to small data, or incrementalize existing batch graph algorithms by minimizing unnecessary recomputation. Using real-life and synthetic data, we experimentally verify the effectiveness of our incremental algorithms.
图查询类( {mathcal {Q}} )的增量问题旨在计算,给定查询( Q in {mathcal {Q}} ),图G,将G中的Q(G)回答为Q,并将ΔG更新为G作为输入,将ΔO更改为输出Q(G),使得Q(G⊕ΔG) = Q(G)⊕ΔO。如果它的代价可以表示为大小为Q, ΔG和ΔO的多项式函数,则称为有界,这将可能的大G的计算减少到小ΔG和ΔO。然而,无论多么理想,我们的第一个结果都是否定的:对于常见的图查询,如遍历、连通性、关键字搜索、模式匹配和最大基数匹配,它们的增量问题是无界的。鉴于负面结果,我们提出了增量图计算有效性的两个特征:(a)可本地化,如果其成本由ΔG中节点的小邻居决定,而不是整个G;(b)相对于批处理图算法( {mathcal {T}} )有界,如果成本由ΔG的大小和受影响区域的变化决定,则必须由任何对( {mathcal {T}} )进行增量化的算法检查。通过提供相应的增量算法,我们证明了上述增量计算要么是可本地化的,要么是相对有界的。也就是说,我们可以将大图上的增量计算减少到小数据上,或者通过最小化不必要的重新计算来增量化现有的批处理图算法。使用真实数据和合成数据,我们通过实验验证了增量算法的有效性。
{"title":"Incremental Graph Computations: Doable and Undoable","authors":"W. Fan, Chao Tian","doi":"10.1145/3500930","DOIUrl":"https://doi.org/10.1145/3500930","url":null,"abstract":"The incremental problem for a class ( {mathcal {Q}} ) of graph queries aims to compute, given a query ( Q in {mathcal {Q}} ) , graph G, answers Q(G) to Q in G and updates ΔG to G as input, changes ΔO to output Q(G) such that Q(G⊕ΔG) = Q(G)⊕ΔO. It is called bounded if its cost can be expressed as a polynomial function in the sizes of Q, ΔG and ΔO, which reduces the computations on possibly big G to small ΔG and ΔO. No matter how desirable, however, our first results are negative: For common graph queries such as traversal, connectivity, keyword search, pattern matching, and maximum cardinality matching, their incremental problems are unbounded. In light of the negative results, we propose two characterizations for the effectiveness of incremental graph computation: (a) localizable, if its cost is decided by small neighbors of nodes in ΔG instead of the entire G; and (b) bounded relative to a batch graph algorithm ( {mathcal {T}} ) , if the cost is determined by the sizes of ΔG and changes to the affected area that is necessarily checked by any algorithms that incrementalize ( {mathcal {T}} ) . We show that the incremental computations above are either localizable or relatively bounded by providing corresponding incremental algorithms. That is, we can either reduce the incremental computations on big graphs to small data, or incrementalize existing batch graph algorithms by minimizing unnecessary recomputation. Using real-life and synthetic data, we experimentally verify the effectiveness of our incremental algorithms.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"5 1","pages":"1 - 44"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90537332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Optimal Joins Using Compressed Quadtrees 使用压缩四叉树的最优连接
Pub Date : 2022-02-23 DOI: 10.1145/3514231
Diego Arroyuelo, G. Navarro, Juan L. Reutter, J. Rojas-Ledesma
Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality one either needs to build completely new indexes or must populate the database with several instantiations of indexes such as B ( + ) -trees. Either way, this means spending an extra amount of storage space that is typically one or two orders of magnitude more than what is required to store the raw data. We show that worst-case optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of any significant extra storage. Our representation is a compressed quadtreefor the static indexes and a quadtreebuilt on the fly that shares subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, which simulates navigation of the quadtreeof the output, and show that the running time of this algorithm is worst-case optimal in data complexity. We implement our index and compare it experimentally with state-of-the-art alternatives. Our experiments show that our index uses even less space than what is needed to store the data in raw form (and replaces it) and one or two orders of magnitude less space than the other indexes. At the same time, our query algorithm is competitive in time, even sharply outperforming other indexes in various cases. Finally, we extend our framework to evaluate more expressive queries from relational algebra, including not only joins and intersections but also unions and negations. To obtain optimality on those more complex formulas, we introduce a lazy version of qdagswe dub lqdags, which allow us navigate over the quadtreerepresenting the output of a formula while only evaluating what is needed from its components. We show that the running time of our query algorithms on this extended set of operations is worst-case optimal under some constraints. Moving to full relational algebra, we also show that lqdagscan handle selections and projections. While worst-case optimality is no longer guaranteed, we introduce a partial materialization scheme that extends results from Deep and Koutris regarding compressed representation of query results.
最坏情况最优连接算法在数据库文献中得到了很多关注。我们现在列举了几种在最坏情况下最优的算法,其中许多算法已经在实践中实现和验证了。然而,这些算法的实现通常需要一个增强的索引结构:为了实现最优性,要么需要构建全新的索引,要么必须用索引的几个实例(如B ( + ) -trees)填充数据库。无论哪种方式,这都意味着要花费额外的存储空间,通常比存储原始数据所需的存储空间多一到两个数量级。我们证明了最坏情况最优算法可以直接从将关系视为变维网格中的点集的表示中获得,而不需要任何显著的额外存储。我们的表示是一个用于静态索引的压缩四叉树和一个动态构建的用于共享中间结果的子树(我们称之为qdag)的四叉树。我们开发了一种组合算法来处理这种表示下的全连接查询,该算法模拟了输出四叉树的导航,并表明该算法的运行时间在数据复杂度上是最坏情况下的最优。我们实现我们的指数,并将其与最先进的替代方案进行实验比较。我们的实验表明,我们的索引使用的空间甚至比以原始形式存储数据(并替换它)所需的空间还要少,而且比其他索引使用的空间少一到两个数量级。同时,我们的查询算法在时间上具有竞争力,在各种情况下甚至大大优于其他索引。最后,我们扩展了我们的框架来评估来自关系代数的更具表现力的查询,不仅包括连接和交集,还包括联合和否定。为了在这些更复杂的公式上获得最优性,我们引入了一个惰性版本的qdag(称为lqdag),它允许我们在表示公式输出的四叉树上导航,同时只评估其组件所需的内容。我们证明了在某些约束条件下,我们的查询算法在这个扩展操作集上的运行时间是最坏情况下最优的。转到完整的关系代数,我们还将展示lqdagscan处理选择和投影。虽然不再保证最坏情况最优性,但我们引入了一个部分物化方案,该方案扩展了Deep和Koutris关于查询结果压缩表示的结果。
{"title":"Optimal Joins Using Compressed Quadtrees","authors":"Diego Arroyuelo, G. Navarro, Juan L. Reutter, J. Rojas-Ledesma","doi":"10.1145/3514231","DOIUrl":"https://doi.org/10.1145/3514231","url":null,"abstract":"Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality one either needs to build completely new indexes or must populate the database with several instantiations of indexes such as B ( + ) -trees. Either way, this means spending an extra amount of storage space that is typically one or two orders of magnitude more than what is required to store the raw data. We show that worst-case optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of any significant extra storage. Our representation is a compressed quadtreefor the static indexes and a quadtreebuilt on the fly that shares subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, which simulates navigation of the quadtreeof the output, and show that the running time of this algorithm is worst-case optimal in data complexity. We implement our index and compare it experimentally with state-of-the-art alternatives. Our experiments show that our index uses even less space than what is needed to store the data in raw form (and replaces it) and one or two orders of magnitude less space than the other indexes. At the same time, our query algorithm is competitive in time, even sharply outperforming other indexes in various cases. Finally, we extend our framework to evaluate more expressive queries from relational algebra, including not only joins and intersections but also unions and negations. To obtain optimality on those more complex formulas, we introduce a lazy version of qdagswe dub lqdags, which allow us navigate over the quadtreerepresenting the output of a formula while only evaluating what is needed from its components. We show that the running time of our query algorithms on this extended set of operations is worst-case optimal under some constraints. Moving to full relational algebra, we also show that lqdagscan handle selections and projections. While worst-case optimality is no longer guaranteed, we introduce a partial materialization scheme that extends results from Deep and Koutris regarding compressed representation of query results.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"31 1","pages":"1 - 53"},"PeriodicalIF":0.0,"publicationDate":"2022-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84988978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
ACM Transactions on Database Systems (TODS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1