Abolfazl Asudeh, Gautam Das, H. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, N. Zhang, Jianwen Zhao
Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and instead just limit the users’ “regret.” Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function. However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in two-dimensional space but is NP-hard on three or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, and so on. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.
{"title":"On Finding Rank Regret Representatives","authors":"Abolfazl Asudeh, Gautam Das, H. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, N. Zhang, Jianwen Zhao","doi":"10.1145/3531054","DOIUrl":"https://doi.org/10.1145/3531054","url":null,"abstract":"Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and instead just limit the users’ “regret.” Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function. However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in two-dimensional space but is NP-hard on three or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, and so on. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"32 1","pages":"1 - 37"},"PeriodicalIF":0.0,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82390719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nofar Carmeli, Shai Zeevi, Christoph Berkholz, A. Conte, B. Kimelfeld, Nicole Schweikardt
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmic-delay enumeration. In this article, we seek a structure that supports the more demanding task of a “random permutation”: polylogarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically meaningful manner. An even more demanding task is that of “random access”: polylogarithmic-time retrieval of an answer whose position is given. We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. We identify a fragment of such UCQs where we can guarantee random access with polylogarithmic access time (and linear-time preprocessing) and a more general fragment where we can guarantee tractable random permutation. For general unions of free-connex acyclic CQs, we devise two algorithms with relaxed guarantees: one has logarithmic delay in expectation, and the other provides a permutation that is almost uniformly distributed. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our random-order enumeration approach over state-of-the-art alternatives.
{"title":"Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration","authors":"Nofar Carmeli, Shai Zeevi, Christoph Berkholz, A. Conte, B. Kimelfeld, Nicole Schweikardt","doi":"10.1145/3531055","DOIUrl":"https://doi.org/10.1145/3531055","url":null,"abstract":"As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmic-delay enumeration. In this article, we seek a structure that supports the more demanding task of a “random permutation”: polylogarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically meaningful manner. An even more demanding task is that of “random access”: polylogarithmic-time retrieval of an answer whose position is given. We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. We identify a fragment of such UCQs where we can guarantee random access with polylogarithmic access time (and linear-time preprocessing) and a more general fragment where we can guarantee tractable random permutation. For general unions of free-connex acyclic CQs, we devise two algorithms with relaxed guarantees: one has logarithmic delay in expectation, and the other provides a permutation that is almost uniformly distributed. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our random-order enumeration approach over state-of-the-art alternatives.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"33 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2022-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76987727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tian Zeng, Zhewei Wei, Ge Luo, K. Yi, Xiaoyong Du, Ji-Rong Wen
A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried. Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today’s big data applications, in particular, for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which uses only sublinear space by sacrificing slightly on accuracy. All streaming algorithms work by maintaining a small data structure in memory, which is usually called a sketch, summary, or synopsis. The summary is updated upon the arrival of every element in the stream, thus it is ephemeral, meaning that it can only answer queries about the current status of the stream. In this article, we aim at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.
{"title":"Persistent Summaries","authors":"Tian Zeng, Zhewei Wei, Ge Luo, K. Yi, Xiaoyong Du, Ji-Rong Wen","doi":"10.1145/3531053","DOIUrl":"https://doi.org/10.1145/3531053","url":null,"abstract":"A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried. Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today’s big data applications, in particular, for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which uses only sublinear space by sacrificing slightly on accuracy. All streaming algorithms work by maintaining a small data structure in memory, which is usually called a sketch, summary, or synopsis. The summary is updated upon the arrival of every element in the stream, thus it is ephemeral, meaning that it can only answer queries about the current status of the stream. In this article, we aim at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"103 1","pages":"1 - 42"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88320936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log n/ε2) expected time while returning a (1 - 1/e - ε) approximate solution with at least 1 - 1/n probability. A key phase of these IM algorithms is the random reverse reachable (RR) set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms. In this article, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k ċ n log n ċ2), when for any node v, the total weight of its incoming edges is no larger than a constant. For the general IC model where the weights are skewed, we present a sampling algorithm SKIP. To the best of our knowledge, it is the first index-free algorithm that achieves the optimal time complexity of the sorted subset sampling problem. Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment. Besides, we investigate the issues of forward propagation and derive its time complexity with our proposed subset sampling techniques. We also present a heuristic condition to indicate when the forward propagation approach should be utilized to estimate the expected influence of a given seed set.
给定一个有n个节点和m条边的社交网络G,一个正整数k和一个级联模型C,影响最大化(IM)问题要求在G中有k个节点,使得在级联模型C下受k个节点影响的节点的期望数量最大化。最先进的近似解在O(k(n+m)log n/ε2)预期时间内运行,同时返回(1 - 1/e - ε)近似解,概率至少为1 - 1/n。随机反向可达集的生成是IM算法的一个关键阶段,这一阶段对当前IM算法的效率和可扩展性有重要影响。本文对这一关键阶段进行了研究,提出了一种高效的IC模型下随机RR集生成算法。利用新算法,我们证明了现有IM算法在IC模型下的期望运行时间可以提高到O(k * * n log n ċ2),对于任何节点v,其传入边的总权重不大于一个常数。对于权重偏斜的一般集成电路模型,我们提出了一种SKIP采样算法。据我们所知,它是第一个实现排序子集采样问题最优时间复杂度的无索引算法。此外,现有的近似IM算法在高影响力网络中存在可扩展性问题,其中随机RR集的大小通常相当大。我们通过在不牺牲近似保证的情况下减少随机RR集的平均大小来解决这个具有挑战性的问题。正如我们的实验所示,所提出的解决方案比目前的技术状态快了几个数量级。此外,我们还研究了前向传播问题,并利用我们提出的子集采样技术推导了前向传播的时间复杂度。我们还提出了一个启发式条件,以指示何时应使用前向传播方法来估计给定种子集的预期影响。
{"title":"Influence Maximization Revisited: Efficient Sampling with Bound Tightened","authors":"Qintian Guo, Sibo Wang, Zhewei Wei, Wenqing Lin, Jing Tang","doi":"10.1145/3533817","DOIUrl":"https://doi.org/10.1145/3533817","url":null,"abstract":"Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log n/ε2) expected time while returning a (1 - 1/e - ε) approximate solution with at least 1 - 1/n probability. A key phase of these IM algorithms is the random reverse reachable (RR) set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms. In this article, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k ċ n log n ċ2), when for any node v, the total weight of its incoming edges is no larger than a constant. For the general IC model where the weights are skewed, we present a sampling algorithm SKIP. To the best of our knowledge, it is the first index-free algorithm that achieves the optimal time complexity of the sorted subset sampling problem. Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment. Besides, we investigate the issues of forward propagation and derive its time complexity with our proposed subset sampling techniques. We also present a heuristic condition to indicate when the forward propagation approach should be utilized to estimate the expected influence of a given seed set.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73446774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Lei Chen, Ke Xu
There has been a dramatic growth of shared mobility applications such as ride-sharing, food delivery, and crowdsourced parcel delivery. Shared mobility refers to transportation services that are shared among users, where a central issue is route planning. Given a set of workers and requests, route planning finds for each worker a route, i.e., a sequence of locations to pick up and drop off passengers/parcels that arrive from time to time, with different optimization objectives. Previous studies lack practicability due to their conflicted objectives and inefficiency in inserting a new request into a route, a basic operation called insertion. In addition, previous route planning solutions fail to exploit the appearance patterns of future requests hidden in historical data for optimization. In this paper, we present a unified formulation of route planning called URPSM. It has a well-defined parameterized objective function which eliminates the contradicted objectives in previous studies and enables flexible multi-objective route planning for shared mobility. We propose two insertion-based frameworks to solve the URPSM problem. The first is built upon the plain-insertion widely used in prior studies, which processes online requests only, whereas the second relies on a new insertion operator called prophet-insertion that handles both online and predicted requests. Novel dynamic programming algorithms are designed to accelerate both insertions to only linear time. Theoretical analysis shows that no online algorithm can have a constant competitive ratio for the URPSM problem under the competitive analysis model, yet our prophet-insertion-based framework can achieve a constant optimality ratio under the instance-optimality model. Extensive experimental results on real datasets show that our insertion-based solutions outperform the state-of-the-art algorithms in both effectiveness and efficiency by a large margin (e.g., up to 30 ( times ) more effective in the objective and up to 20 ( times ) faster).
拼车、送餐和众包包裹等共享移动应用急剧增长。共享出行指的是用户之间共享的交通服务,其核心问题是路线规划。给定一组工人和请求,路线规划为每个工人找到一条路线,即一系列地点,以不同的优化目标接送不时到达的乘客/包裹。以往的研究缺乏实用性,因为它们的目标相互冲突,并且在将新请求插入路由(一种称为插入的基本操作)时效率低下。此外,以前的路由规划方案无法利用隐藏在历史数据中的未来请求的外观模式进行优化。在本文中,我们提出了一种统一的路由规划公式,称为URPSM。它具有定义良好的参数化目标函数,消除了以往研究中存在的目标矛盾,实现了共享出行的灵活多目标路径规划。我们提出了两个基于插入的框架来解决URPSM问题。前者建立在先前研究中广泛使用的纯插入的基础上,它只处理在线请求,而后者则依赖于一种新的插入算子,称为预言插入,它既处理在线请求,也处理预测请求。设计了一种新的动态规划算法,将插入时间缩短到线性时间。理论分析表明,在竞争分析模型下,URPSM问题的在线算法不可能具有恒定的竞争比,而我们的基于先知插入的框架在实例最优性模型下可以实现恒定的最优比。在真实数据集上的大量实验结果表明,我们基于插入的解决方案在有效性和效率方面都大大优于最先进的算法(例如,在目标方面效率高达30 ( times ),速度高达20 ( times ))。
{"title":"Unified Route Planning for Shared Mobility: An Insertion-based Framework","authors":"Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Lei Chen, Ke Xu","doi":"10.1145/3488723","DOIUrl":"https://doi.org/10.1145/3488723","url":null,"abstract":"There has been a dramatic growth of shared mobility applications such as ride-sharing, food delivery, and crowdsourced parcel delivery. Shared mobility refers to transportation services that are shared among users, where a central issue is route planning. Given a set of workers and requests, route planning finds for each worker a route, i.e., a sequence of locations to pick up and drop off passengers/parcels that arrive from time to time, with different optimization objectives. Previous studies lack practicability due to their conflicted objectives and inefficiency in inserting a new request into a route, a basic operation called insertion. In addition, previous route planning solutions fail to exploit the appearance patterns of future requests hidden in historical data for optimization. In this paper, we present a unified formulation of route planning called URPSM. It has a well-defined parameterized objective function which eliminates the contradicted objectives in previous studies and enables flexible multi-objective route planning for shared mobility. We propose two insertion-based frameworks to solve the URPSM problem. The first is built upon the plain-insertion widely used in prior studies, which processes online requests only, whereas the second relies on a new insertion operator called prophet-insertion that handles both online and predicted requests. Novel dynamic programming algorithms are designed to accelerate both insertions to only linear time. Theoretical analysis shows that no online algorithm can have a constant competitive ratio for the URPSM problem under the competitive analysis model, yet our prophet-insertion-based framework can achieve a constant optimality ratio under the instance-optimality model. Extensive experimental results on real datasets show that our insertion-based solutions outperform the state-of-the-art algorithms in both effectiveness and efficiency by a large margin (e.g., up to 30 ( times ) more effective in the objective and up to 20 ( times ) faster).","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"6 1","pages":"1 - 48"},"PeriodicalIF":0.0,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86740775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gerald Berger, G. Gottlob, Andreas Pieris, Emanuel Sallinger
Vadalog is a system for performing complex reasoning tasks such as those required in advanced knowledge graphs. The logical core of the underlying Vadalog language is the warded fragment of tuple-generating dependencies (TGDs). This formalism ensures tractable reasoning in data complexity, while a recent analysis focusing on a practical implementation led to the reasoning algorithm around which the Vadalog system is built. A fundamental question that has emerged in the context of Vadalog is whether we can limit the recursion allowed by wardedness in order to obtain a formalism that provides a convenient syntax for expressing useful recursive statements, and at the same time achieves space-efficiency. After analyzing several real-life examples of warded sets of TGDs provided by our industrial partners, as well as recent benchmarks, we observed that recursion is often used in a restricted way: the body of a TGD contains at most one atom whose predicate is mutually recursive with a predicate in the head. We show that this type of recursion, known as piece-wise linear in the Datalog literature, is the answer to our main question. We further show that piece-wise linear recursion alone, without the wardedness condition, is not enough as it leads to undecidability. We also study the relative expressiveness of the query languages based on (piece-wise linear) warded sets of TGDs. Finally, we give preliminary experimental evidence for the practical effect of piece-wise linearity on Vadalog.
{"title":"The Space-Efficient Core of Vadalog","authors":"Gerald Berger, G. Gottlob, Andreas Pieris, Emanuel Sallinger","doi":"10.1145/3488720","DOIUrl":"https://doi.org/10.1145/3488720","url":null,"abstract":"Vadalog is a system for performing complex reasoning tasks such as those required in advanced knowledge graphs. The logical core of the underlying Vadalog language is the warded fragment of tuple-generating dependencies (TGDs). This formalism ensures tractable reasoning in data complexity, while a recent analysis focusing on a practical implementation led to the reasoning algorithm around which the Vadalog system is built. A fundamental question that has emerged in the context of Vadalog is whether we can limit the recursion allowed by wardedness in order to obtain a formalism that provides a convenient syntax for expressing useful recursive statements, and at the same time achieves space-efficiency. After analyzing several real-life examples of warded sets of TGDs provided by our industrial partners, as well as recent benchmarks, we observed that recursion is often used in a restricted way: the body of a TGD contains at most one atom whose predicate is mutually recursive with a predicate in the head. We show that this type of recursion, known as piece-wise linear in the Datalog literature, is the answer to our main question. We further show that piece-wise linear recursion alone, without the wardedness condition, is not enough as it leads to undecidability. We also study the relative expressiveness of the query languages based on (piece-wise linear) warded sets of TGDs. Finally, we give preliminary experimental evidence for the practical effect of piece-wise linearity on Vadalog.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"18 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82701397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert Binna, Eva Zangerle, M. Pichl, Günther Specht, Viktor Leis
We present the Height Optimized Trie (HOT), a fast and space-efficient in-memory index structure. The core algorithmic idea of HOT is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. For a fixed maximum node fanout, the overall tree height is minimal and its structure is deterministically defined. Multiple carefully engineered node implementations using SIMD instructions or lightweight compression schemes provide compactness and fast search and optimize HOT structures for different usage scenarios. Our experiments, which use a wide variety of workloads and data sets, show that HOT outperforms other state-of-the-art index structures for string keys both in terms of search performance and memory footprint, while being competitive for integer keys.
{"title":"Height Optimized Tries","authors":"Robert Binna, Eva Zangerle, M. Pichl, Günther Specht, Viktor Leis","doi":"10.1145/3506692","DOIUrl":"https://doi.org/10.1145/3506692","url":null,"abstract":"We present the Height Optimized Trie (HOT), a fast and space-efficient in-memory index structure. The core algorithmic idea of HOT is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. For a fixed maximum node fanout, the overall tree height is minimal and its structure is deterministically defined. Multiple carefully engineered node implementations using SIMD instructions or lightweight compression schemes provide compactness and fast search and optimize HOT structures for different usage scenarios. Our experiments, which use a wide variety of workloads and data sets, show that HOT outperforms other state-of-the-art index structures for string keys both in terms of search performance and memory footprint, while being competitive for integer keys.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"1 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85009160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ji Cheng, Da Yan, Wenwen Qu, Xiaotian Hao, Cheng Long, Wilfred Ng, Xiaoling Wang
Given a data matrix ( D ) , a submatrix ( S ) of ( D ) is an order-preserving submatrix (OPSM) if there is a permutation of the columns of ( S ) , under which the entry values of each row in ( S ) are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work on OPSM has ever considered uncertain value intervals using the well-established possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support-based and (2) probabilistic frequentness-based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, with a closed-form formula derived to efficiently handle the special case of uniform value distribution and an accurate cubic spline approximation approach that works well with any uncertain value distributions. To efficiently check the probabilistic frequentness, several effective pruning rules are designed to efficiently prune insignificant OPSMs; two approximation techniques based on the Poisson and Gaussian distributions, respectively, are proposed for further speedup. These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori, respectively. We further parallelize our prefix-projection-based mining algorithm using PrefixFPM, a recently proposed framework for parallel frequent pattern mining, and we achieve a good speedup with the number of CPU cores. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.
给定一个数据矩阵( D ),如果存在( S )列的置换,则( D )的子矩阵( S )是一个保序子矩阵(OPSM),在这种置换下,( S )中每一行的条目值都严格递增。OPSM挖掘广泛应用于现实生活中,如识别共表达基因和寻找具有相似偏好的客户。然而,由于实验条件和测量误差的变化,噪声在真实数据矩阵中普遍存在,使得传统的OPSM挖掘算法无法适用。以前在OPSM上的工作从来没有使用公认的可能世界语义来考虑不确定值区间。我们基于可能世界语义建立了两种不同的重要opsm定义:(1)基于期望支持度和(2)基于概率频率。提出了一种优化的动态规划方法来计算行支持特定列排列的概率,推导了一个封闭公式来有效地处理均匀值分布的特殊情况,以及一个精确的三次样条近似方法,可以很好地处理任何不确定值分布。为了有效地检查概率频率,设计了几种有效的剪枝规则来有效地剪枝不重要的opsm;分别提出了基于泊松分布和高斯分布的两种近似技术来进一步提高速度。这些技术被集成到我们的两个OPSM挖掘算法中,分别基于前缀投影和Apriori。我们使用最近提出的并行频繁模式挖掘框架PrefixFPM进一步并行化基于前缀投影的挖掘算法,并在CPU内核数量上实现了良好的加速。在实际微阵列数据上的大量实验表明,我们的算法发现的opsm比现有方法发现的opsm质量高得多。
{"title":"Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation Methods","authors":"Ji Cheng, Da Yan, Wenwen Qu, Xiaotian Hao, Cheng Long, Wilfred Ng, Xiaoling Wang","doi":"10.1145/3524915","DOIUrl":"https://doi.org/10.1145/3524915","url":null,"abstract":"Given a data matrix ( D ) , a submatrix ( S ) of ( D ) is an order-preserving submatrix (OPSM) if there is a permutation of the columns of ( S ) , under which the entry values of each row in ( S ) are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real data matrices due to variable experimental conditions and measurement errors, which makes conventional OPSM mining algorithms inapplicable. No previous work on OPSM has ever considered uncertain value intervals using the well-established possible world semantics. We establish two different definitions of significant OPSMs based on the possible world semantics: (1) expected support-based and (2) probabilistic frequentness-based. An optimized dynamic programming approach is proposed to compute the probability that a row supports a particular column permutation, with a closed-form formula derived to efficiently handle the special case of uniform value distribution and an accurate cubic spline approximation approach that works well with any uncertain value distributions. To efficiently check the probabilistic frequentness, several effective pruning rules are designed to efficiently prune insignificant OPSMs; two approximation techniques based on the Poisson and Gaussian distributions, respectively, are proposed for further speedup. These techniques are integrated into our two OPSM mining algorithms, based on prefix-projection and Apriori, respectively. We further parallelize our prefix-projection-based mining algorithm using PrefixFPM, a recently proposed framework for parallel frequent pattern mining, and we achieve a good speedup with the number of CPU cores. Extensive experiments on real microarray data demonstrate that the OPSMs found by our algorithms have a much higher quality than those found by existing approaches.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"25 1","pages":"1 - 57"},"PeriodicalIF":0.0,"publicationDate":"2022-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88400335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The incremental problem for a class ( {mathcal {Q}} ) of graph queries aims to compute, given a query ( Q in {mathcal {Q}} ) , graph G, answers Q(G) to Q in G and updates ΔG to G as input, changes ΔO to output Q(G) such that Q(G⊕ΔG) = Q(G)⊕ΔO. It is called bounded if its cost can be expressed as a polynomial function in the sizes of Q, ΔG and ΔO, which reduces the computations on possibly big G to small ΔG and ΔO. No matter how desirable, however, our first results are negative: For common graph queries such as traversal, connectivity, keyword search, pattern matching, and maximum cardinality matching, their incremental problems are unbounded. In light of the negative results, we propose two characterizations for the effectiveness of incremental graph computation: (a) localizable, if its cost is decided by small neighbors of nodes in ΔG instead of the entire G; and (b) bounded relative to a batch graph algorithm ( {mathcal {T}} ) , if the cost is determined by the sizes of ΔG and changes to the affected area that is necessarily checked by any algorithms that incrementalize ( {mathcal {T}} ) . We show that the incremental computations above are either localizable or relatively bounded by providing corresponding incremental algorithms. That is, we can either reduce the incremental computations on big graphs to small data, or incrementalize existing batch graph algorithms by minimizing unnecessary recomputation. Using real-life and synthetic data, we experimentally verify the effectiveness of our incremental algorithms.
{"title":"Incremental Graph Computations: Doable and Undoable","authors":"W. Fan, Chao Tian","doi":"10.1145/3500930","DOIUrl":"https://doi.org/10.1145/3500930","url":null,"abstract":"The incremental problem for a class ( {mathcal {Q}} ) of graph queries aims to compute, given a query ( Q in {mathcal {Q}} ) , graph G, answers Q(G) to Q in G and updates ΔG to G as input, changes ΔO to output Q(G) such that Q(G⊕ΔG) = Q(G)⊕ΔO. It is called bounded if its cost can be expressed as a polynomial function in the sizes of Q, ΔG and ΔO, which reduces the computations on possibly big G to small ΔG and ΔO. No matter how desirable, however, our first results are negative: For common graph queries such as traversal, connectivity, keyword search, pattern matching, and maximum cardinality matching, their incremental problems are unbounded. In light of the negative results, we propose two characterizations for the effectiveness of incremental graph computation: (a) localizable, if its cost is decided by small neighbors of nodes in ΔG instead of the entire G; and (b) bounded relative to a batch graph algorithm ( {mathcal {T}} ) , if the cost is determined by the sizes of ΔG and changes to the affected area that is necessarily checked by any algorithms that incrementalize ( {mathcal {T}} ) . We show that the incremental computations above are either localizable or relatively bounded by providing corresponding incremental algorithms. That is, we can either reduce the incremental computations on big graphs to small data, or incrementalize existing batch graph algorithms by minimizing unnecessary recomputation. Using real-life and synthetic data, we experimentally verify the effectiveness of our incremental algorithms.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"5 1","pages":"1 - 44"},"PeriodicalIF":0.0,"publicationDate":"2022-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90537332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diego Arroyuelo, G. Navarro, Juan L. Reutter, J. Rojas-Ledesma
Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality one either needs to build completely new indexes or must populate the database with several instantiations of indexes such as B ( + ) -trees. Either way, this means spending an extra amount of storage space that is typically one or two orders of magnitude more than what is required to store the raw data. We show that worst-case optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of any significant extra storage. Our representation is a compressed quadtreefor the static indexes and a quadtreebuilt on the fly that shares subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, which simulates navigation of the quadtreeof the output, and show that the running time of this algorithm is worst-case optimal in data complexity. We implement our index and compare it experimentally with state-of-the-art alternatives. Our experiments show that our index uses even less space than what is needed to store the data in raw form (and replaces it) and one or two orders of magnitude less space than the other indexes. At the same time, our query algorithm is competitive in time, even sharply outperforming other indexes in various cases. Finally, we extend our framework to evaluate more expressive queries from relational algebra, including not only joins and intersections but also unions and negations. To obtain optimality on those more complex formulas, we introduce a lazy version of qdagswe dub lqdags, which allow us navigate over the quadtreerepresenting the output of a formula while only evaluating what is needed from its components. We show that the running time of our query algorithms on this extended set of operations is worst-case optimal under some constraints. Moving to full relational algebra, we also show that lqdagscan handle selections and projections. While worst-case optimality is no longer guaranteed, we introduce a partial materialization scheme that extends results from Deep and Koutris regarding compressed representation of query results.
{"title":"Optimal Joins Using Compressed Quadtrees","authors":"Diego Arroyuelo, G. Navarro, Juan L. Reutter, J. Rojas-Ledesma","doi":"10.1145/3514231","DOIUrl":"https://doi.org/10.1145/3514231","url":null,"abstract":"Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality one either needs to build completely new indexes or must populate the database with several instantiations of indexes such as B ( + ) -trees. Either way, this means spending an extra amount of storage space that is typically one or two orders of magnitude more than what is required to store the raw data. We show that worst-case optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of any significant extra storage. Our representation is a compressed quadtreefor the static indexes and a quadtreebuilt on the fly that shares subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, which simulates navigation of the quadtreeof the output, and show that the running time of this algorithm is worst-case optimal in data complexity. We implement our index and compare it experimentally with state-of-the-art alternatives. Our experiments show that our index uses even less space than what is needed to store the data in raw form (and replaces it) and one or two orders of magnitude less space than the other indexes. At the same time, our query algorithm is competitive in time, even sharply outperforming other indexes in various cases. Finally, we extend our framework to evaluate more expressive queries from relational algebra, including not only joins and intersections but also unions and negations. To obtain optimality on those more complex formulas, we introduce a lazy version of qdagswe dub lqdags, which allow us navigate over the quadtreerepresenting the output of a formula while only evaluating what is needed from its components. We show that the running time of our query algorithms on this extended set of operations is worst-case optimal under some constraints. Moving to full relational algebra, we also show that lqdagscan handle selections and projections. While worst-case optimality is no longer guaranteed, we introduce a partial materialization scheme that extends results from Deep and Koutris regarding compressed representation of query results.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"31 1","pages":"1 - 53"},"PeriodicalIF":0.0,"publicationDate":"2022-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84988978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}