ACM Transactions on Database Systems (TODS)最新文献

英文中文

An Empirical Study of Moment Estimators for Quantile Approximation 分位数近似矩估计的经验研究

ACM Transactions on Database Systems (TODS)

Pub Date : 2021-03-18 DOI: 10.1145/3442337

Rory Mitchell, E. Frank, G. Holmes

We empirically evaluate lightweight moment estimators for the single-pass quantile approximation problem, including maximum entropy methods and orthogonal series with Fourier, Cosine, Legendre, Chebyshev and Hermite basis functions. We show how to apply stable summation formulas to offset numerical precision issues for higher-order moments, leading to reliable single-pass moment estimators up to order 15. Additionally, we provide an algorithm for GPU-accelerated quantile approximation based on parallel tree reduction. Experiments evaluate the accuracy and runtime of moment estimators against the state-of-the-art KLL quantile estimator on 14,072 real-world datasets drawn from the OpenML database. Our analysis highlights the effectiveness of variants of moment-based quantile approximation for highly space efficient summaries: their average performance using as few as five sample moments can approach the performance of a KLL sketch containing 500 elements. Experiments also illustrate the difficulty of applying the method reliably and showcases which moment-based approximations can be expected to fail or perform poorly.

我们对单次经过分位近似问题的轻量矩估计进行了经验评估，包括最大熵方法和具有傅里叶、余弦、勒让德、切比雪夫和埃尔米特基函数的正交级数。我们展示了如何应用稳定的求和公式来抵消高阶矩的数值精度问题，从而产生可靠的单次传递矩估计器，最高可达15阶。此外，我们还提供了一种基于并行树约简的gpu加速分位数逼近算法。在OpenML数据库中抽取的14072个真实数据集上，实验评估了力矩估计器与最先进的KLL分位数估计器的精度和运行时间。我们的分析强调了基于矩的分位数近似变体对于高度空间高效摘要的有效性:它们使用少至五个样本矩的平均性能可以接近包含500个元素的KLL草图的性能。实验还说明了可靠地应用该方法的困难，并展示了基于矩的近似可能失败或表现不佳。

引用次数: 5

Evaluation of Machine Learning Algorithms in Predicting the Next SQL Query from the Future 机器学习算法在预测未来的下一个SQL查询中的评价

ACM Transactions on Database Systems (TODS)

Pub Date : 2021-03-18 DOI: 10.1145/3442338

Venkata Vamsikrishna Meduri, Kanchan Chowdhury, Mohamed Sarwat

Prediction of the next SQL query from the user, given her sequence of queries until the current timestep, during an ongoing interaction session of the user with the database, can help in speculative query processing and increased interactivity. While existing machine learning-- (ML) based approaches use recommender systems to suggest relevant queries to a user, there has been no exhaustive study on applying temporal predictors to predict the next user issued query. In this work, we experimentally compare ML algorithms in predicting the immediate next future query in an interaction workload, given the current user query or the sequence of queries in a user session thus far. As a part of this, we propose the adaptation of two powerful temporal predictors: (a) Recurrent Neural Networks (RNNs) and (b) a Reinforcement Learning approach called Q-Learning that uses Markov Decision Processes. We represent each query as a comprehensive set of fragment embeddings that not only captures the SQL operators, attributes, and relations but also the arithmetic comparison operators and constants that occur in the query. Our experiments on two real-world datasets show the effectiveness of temporal predictors against the baseline recommender systems in predicting the structural fragments in a query w.r.t. both quality and time. Besides showing that RNNs can be used to synthesize novel queries, we find that exact Q-Learning outperforms RNNs despite predicting the next query entirely from the historical query logs.

在用户与数据库的持续交互会话期间，根据用户在当前时间步之前的查询序列，预测用户的下一个SQL查询，可以帮助推测性查询处理和增强交互性。虽然现有的基于机器学习(ML)的方法使用推荐系统向用户建议相关查询，但尚未对应用时间预测器来预测下一个用户发出的查询进行详尽的研究。在这项工作中，我们通过实验比较了ML算法在交互工作负载中预测下一个未来查询的能力，给定当前用户查询或到目前为止用户会话中的查询序列。作为其中的一部分，我们提出了两种强大的时间预测因子的适应:(a)循环神经网络(rnn)和(b)使用马尔可夫决策过程的强化学习方法，称为q -学习。我们将每个查询表示为一组全面的片段嵌入，这些片段嵌入不仅捕获SQL操作符、属性和关系，还捕获查询中出现的算术比较操作符和常量。我们在两个真实数据集上的实验表明，相对于基线推荐系统，时间预测器在预测查询质量和时间上的结构片段方面是有效的。除了表明rnn可以用于合成新的查询之外，我们发现精确的Q-Learning优于rnn，尽管完全从历史查询日志中预测下一个查询。

{"title":"Evaluation of Machine Learning Algorithms in Predicting the Next SQL Query from the Future","authors":"Venkata Vamsikrishna Meduri, Kanchan Chowdhury, Mohamed Sarwat","doi":"10.1145/3442338","DOIUrl":"https://doi.org/10.1145/3442338","url":null,"abstract":"Prediction of the next SQL query from the user, given her sequence of queries until the current timestep, during an ongoing interaction session of the user with the database, can help in speculative query processing and increased interactivity. While existing machine learning-- (ML) based approaches use recommender systems to suggest relevant queries to a user, there has been no exhaustive study on applying temporal predictors to predict the next user issued query. In this work, we experimentally compare ML algorithms in predicting the immediate next future query in an interaction workload, given the current user query or the sequence of queries in a user session thus far. As a part of this, we propose the adaptation of two powerful temporal predictors: (a) Recurrent Neural Networks (RNNs) and (b) a Reinforcement Learning approach called Q-Learning that uses Markov Decision Processes. We represent each query as a comprehensive set of fragment embeddings that not only captures the SQL operators, attributes, and relations but also the arithmetic comparison operators and constants that occur in the query. Our experiments on two real-world datasets show the effectiveness of temporal predictors against the baseline recommender systems in predicting the structural fragments in a query w.r.t. both quality and time. Besides showing that RNNs can be used to synthesize novel queries, we find that exact Q-Learning outperforms RNNs despite predicting the next query entirely from the historical query logs.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"20 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2021-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81684309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Sampling a Near Neighbor in High Dimensions — Who is the Fairest of Them All? 在高维空间对近邻进行抽样——谁是最公平的?

ACM Transactions on Database Systems (TODS)

Pub Date : 2021-01-26 DOI: 10.1145/3502867

Martin Aumuller, Sariel Har-Peled, S. Mahabadi, R. Pagh, Francesco Silvestri

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the r-near neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the unfairness of state-of-the-art NN data structures and shows the performance of our algorithms on real-world datasets.

相似搜索是一种基本的算法原语，广泛应用于许多计算机科学学科。给定一组点S和半径参数r > 0, r近邻(r- nn)问题要求一种数据结构，即给定任何查询点q，返回距离q不超过r的点p。在本文中，我们从个体公平和提供均等机会的角度研究r- nn问题:距离查询点r的所有点都应该具有相同的返回概率。在低维情况下，这个问题首先由Hu, Qiao, and Tao (PODS 2014)研究。局部敏感哈希(LSH)是理论上最强大的高维相似性搜索方法，但它不能提供这样的公平性保证。在这项工作中，我们证明了基于LSH的算法可以在没有显著效率损失的情况下变得公平。我们为公平神经网络问题的精确和近似变体提出了几种有效的数据结构。我们的方法更普遍地适用于从给定集合的集合的子集合中统一采样，并且可以用于其他一些应用程序。我们还开发了一种内积下公平相似搜索的数据结构，该结构需要近线性空间并利用局域敏感滤波器。本文最后进行了一个实验评估，强调了最先进的神经网络数据结构的不公平性，并展示了我们的算法在现实世界数据集上的性能。

{"title":"Sampling a Near Neighbor in High Dimensions — Who is the Fairest of Them All?","authors":"Martin Aumuller, Sariel Har-Peled, S. Mahabadi, R. Pagh, Francesco Silvestri","doi":"10.1145/3502867","DOIUrl":"https://doi.org/10.1145/3502867","url":null,"abstract":"Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the r-near neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the unfairness of state-of-the-art NN data structures and shows the performance of our algorithms on real-world datasets.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"69 1","pages":"1 - 40"},"PeriodicalIF":0.0,"publicationDate":"2021-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82711347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Flexible Skylines 灵活的高楼大厦

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-12-10 DOI: 10.1145/3406113

P. Ciaccia, D. Martinenghi

Skyline and ranking queries are two popular, alternative ways of discovering interesting data in large datasets. Skyline queries are simple to specify, as they just return the set of all non-dominated tuples, thereby providing an overall view of potentially interesting results. However, they are not equipped with any means to accommodate user preferences or to control the cardinality of the result set. Ranking queries adopt, instead, a specific scoring function to rank tuples, and can easily control the output size. While specifying a scoring function allows one to give different importance to different attributes by means of, e.g., weight parameters, choosing the “right” weights to use is known to be a hard problem. In this article, we embrace the skyline approach by introducing an original framework able to capture user preferences by means of constraints on the weights used in a scoring function, which is typically much easier than specifying precise weight values. To this end, we introduce the novel concept of F-dominance, i.e., dominance with respect to a family of scoring functions F: a tuple t is said to F-dominate tuple s when t is always better than or equal to s according to all the functions in F. Based on F-dominance, we present two flexible skyline (F-skyline) operators, both returning a subset of the skyline: nd, characterizing the set of non-F-dominated tuples; po, referring to the tuples that are also potentially optimal, i.e., best according to some function in F. While nd and po coincide and reduce to the traditional skyline when F is the family of all monotone scoring functions, their behaviors differ when subsets thereof are considered. We discuss the formal properties of these new operators, show how to implement them efficiently, and evaluate them on both synthetic and real datasets.

Skyline和排名查询是在大型数据集中发现有趣数据的两种流行的替代方法。Skyline查询很容易指定，因为它们只返回所有非主导元组的集合，从而提供潜在有趣结果的整体视图。但是，它们没有配备任何方法来适应用户首选项或控制结果集的基数。相反，排名查询采用特定的评分函数对元组进行排名，并且可以轻松控制输出大小。虽然指定一个评分函数允许人们通过权重参数等方式给不同的属性赋予不同的重要性，但选择“正确”的权重是一个难题。在本文中，我们采用了天际线方法，引入了一个原始框架，该框架能够通过对评分函数中使用的权重的约束来捕获用户偏好，这通常比指定精确的权重值容易得多。为此，我们引入了F-优势的新概念，即对评分函数F族的优势:当t总是优于或等于s时，根据F中的所有函数，我们将元组t称为F-主导元组s。基于F-优势，我们提出了两个灵活的天际线(F-skyline)算子，它们都返回天际线的子集;po，指的是根据F中的某个函数也是潜在最优的元组，即最优的元组。当F是所有单调评分函数的族时，nd和po重合并简化为传统的天际线，但当考虑其子集时，它们的行为不同。我们讨论了这些新算子的形式性质，展示了如何有效地实现它们，并在合成数据集和真实数据集上对它们进行了评估。

{"title":"Flexible Skylines","authors":"P. Ciaccia, D. Martinenghi","doi":"10.1145/3406113","DOIUrl":"https://doi.org/10.1145/3406113","url":null,"abstract":"Skyline and ranking queries are two popular, alternative ways of discovering interesting data in large datasets. Skyline queries are simple to specify, as they just return the set of all non-dominated tuples, thereby providing an overall view of potentially interesting results. However, they are not equipped with any means to accommodate user preferences or to control the cardinality of the result set. Ranking queries adopt, instead, a specific scoring function to rank tuples, and can easily control the output size. While specifying a scoring function allows one to give different importance to different attributes by means of, e.g., weight parameters, choosing the “right” weights to use is known to be a hard problem. In this article, we embrace the skyline approach by introducing an original framework able to capture user preferences by means of constraints on the weights used in a scoring function, which is typically much easier than specifying precise weight values. To this end, we introduce the novel concept of F-dominance, i.e., dominance with respect to a family of scoring functions F: a tuple t is said to F-dominate tuple s when t is always better than or equal to s according to all the functions in F. Based on F-dominance, we present two flexible skyline (F-skyline) operators, both returning a subset of the skyline: nd, characterizing the set of non-F-dominated tuples; po, referring to the tuples that are also potentially optimal, i.e., best according to some function in F. While nd and po coincide and reduce to the traditional skyline when F is the family of all monotone scoring functions, their behaviors differ when subsets thereof are considered. We discuss the formal properties of these new operators, show how to implement them efficiently, and evaluate them on both synthetic and real datasets.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"30 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2020-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89770464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Incremental and Approximate Computations for Accelerating Deep CNN Inference 加速深度CNN推理的增量和近似计算

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-12-06 DOI: 10.1145/3397461

Supun Nakandala, Kabir Nagrecha, Arun Kumar, Y. Papakonstantinou

Deep learning now offers state-of-the-art accuracy for many prediction tasks. A form of deep learning called deep convolutional neural networks (CNNs) are especially popular on image, video, and time series data. Due to its high computational cost, CNN inference is often a bottleneck in analytics tasks on such data. Thus, a lot of work in the computer architecture, systems, and compilers communities study how to make CNN inference faster. In this work, we show that by elevating the abstraction level and re-imagining CNN inference as queries, we can bring to bear database-style query optimization techniques to improve CNN inference efficiency. We focus on tasks that perform CNN inference repeatedly on inputs that are only slightly different. We identify two popular CNN tasks with this behavior: occlusion-based explanations (OBE) and object recognition in videos (ORV). OBE is a popular method for “explaining” CNN predictions. It outputs a heatmap over the input to show which regions (e.g., image pixels) mattered most for a given prediction. It leads to many re-inference requests on locally modified inputs. ORV uses CNNs to identify and track objects across video frames. It also leads to many re-inference requests. We cast such tasks in a unified manner as a novel instance of the incremental view maintenance problem and create a comprehensive algebraic framework for incremental CNN inference that reduces computational costs. We produce materialized views of features produced inside a CNN and connect them with a novel multi-query optimization scheme for CNN re-inference. Finally, we also devise novel OBE-specific and ORV-specific approximate inference optimizations exploiting their semantics. We prototype our ideas in Python to create a tool called Krypton that supports both CPUs and GPUs. Experiments with real data and CNNs show that Krypton reduces runtimes by up to 5× (respectively, 35×) to produce exact (respectively, high-quality approximate) results without raising resource requirements.

深度学习现在为许多预测任务提供了最先进的准确性。深度学习的一种形式被称为深度卷积神经网络(cnn)，在图像、视频和时间序列数据上尤其流行。由于计算成本高，CNN推理往往是这类数据分析任务的瓶颈。因此，计算机体系结构、系统和编译器社区的许多工作都在研究如何使CNN推理更快。在这项工作中，我们表明通过提升抽象级别并将CNN推理重新想象为查询，我们可以采用数据库风格的查询优化技术来提高CNN推理效率。我们关注的任务是对只有轻微不同的输入重复执行CNN推理。我们用这种行为确定了两个流行的CNN任务:基于遮挡的解释(OBE)和视频中的对象识别(ORV)。出窍是“解释”CNN预测的一种流行方法。它在输入上输出一个热图，以显示哪些区域(例如，图像像素)对给定的预测最重要。它会导致对本地修改输入的许多重新推理请求。ORV使用cnn来识别和跟踪视频帧中的物体。它还会导致许多重新推理请求。我们以统一的方式将这些任务作为增量视图维护问题的新实例，并为增量CNN推理创建了一个全面的代数框架，从而降低了计算成本。我们生成了CNN内部产生的特征的物化视图，并将它们与一种新的CNN再推理的多查询优化方案联系起来。最后，我们还利用它们的语义设计了新的特定于obe和特定于orv的近似推理优化。我们在Python中创建了一个名为Krypton的工具，该工具同时支持cpu和gpu。对真实数据和cnn的实验表明，在不增加资源需求的情况下，氪将运行时间减少了高达5倍(分别为35倍)，以产生精确(分别为高质量近似)的结果。

{"title":"Incremental and Approximate Computations for Accelerating Deep CNN Inference","authors":"Supun Nakandala, Kabir Nagrecha, Arun Kumar, Y. Papakonstantinou","doi":"10.1145/3397461","DOIUrl":"https://doi.org/10.1145/3397461","url":null,"abstract":"Deep learning now offers state-of-the-art accuracy for many prediction tasks. A form of deep learning called deep convolutional neural networks (CNNs) are especially popular on image, video, and time series data. Due to its high computational cost, CNN inference is often a bottleneck in analytics tasks on such data. Thus, a lot of work in the computer architecture, systems, and compilers communities study how to make CNN inference faster. In this work, we show that by elevating the abstraction level and re-imagining CNN inference as queries, we can bring to bear database-style query optimization techniques to improve CNN inference efficiency. We focus on tasks that perform CNN inference repeatedly on inputs that are only slightly different. We identify two popular CNN tasks with this behavior: occlusion-based explanations (OBE) and object recognition in videos (ORV). OBE is a popular method for “explaining” CNN predictions. It outputs a heatmap over the input to show which regions (e.g., image pixels) mattered most for a given prediction. It leads to many re-inference requests on locally modified inputs. ORV uses CNNs to identify and track objects across video frames. It also leads to many re-inference requests. We cast such tasks in a unified manner as a novel instance of the incremental view maintenance problem and create a comprehensive algebraic framework for incremental CNN inference that reduces computational costs. We produce materialized views of features produced inside a CNN and connect them with a novel multi-query optimization scheme for CNN re-inference. Finally, we also devise novel OBE-specific and ORV-specific approximate inference optimizations exploiting their semantics. We prototype our ideas in Python to create a tool called Krypton that supports both CPUs and GPUs. Experiments with real data and CNNs show that Krypton reduces runtimes by up to 5× (respectively, 35×) to produce exact (respectively, high-quality approximate) results without raising resource requirements.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"9 1","pages":"1 - 42"},"PeriodicalIF":0.0,"publicationDate":"2020-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75565045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

MobilityDB

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-12-06 DOI: 10.1145/3406534

E. Zimányi, M. Sakr, Arthur Lesuisse

Despite two decades of research in moving object databases and a few research prototypes that have been proposed, there is not yet a mainstream system targeted for industrial use. In this article, we present MobilityDB, a moving object database that extends the type system of PostgreSQL and PostGIS with abstract data types for representing moving object data. The types are fully integrated into the platform to reuse its powerful data management features. Furthermore, MobilityDB builds on existing operations, indexing, aggregation, and optimization framework. This is all made accessible via the SQL query interface.

尽管在移动对象数据库方面进行了20年的研究，并提出了一些研究原型，但目前还没有一个针对工业用途的主流系统。在本文中，我们介绍了MobilityDB，一个移动对象数据库，它扩展了PostgreSQL和PostGIS的类型系统，使用抽象数据类型来表示移动对象数据。这些类型完全集成到平台中，以重用其强大的数据管理功能。此外，MobilityDB建立在现有的操作、索引、聚合和优化框架之上。所有这些都可以通过SQL查询接口访问。

引用次数: 31

Discovering Graph Functional Dependencies 发现图函数依赖

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-09-11 DOI: 10.1145/3397198

W. Fan, Chunming Hu, Xueli Liu, Ping Lu

This article studies discovery of Graph Functional Dependencies (GFDs), a class of functional dependencies defined on graphs. We investigate the fixed-parameter tractability of three fundamental problems related to GFD discovery. We show that the implication and satisfiability problems are fixed-parameter tractable, but the validation problem is co-W[1]-hard in general. We introduce notions of reduced GFDs and their topological support, and formalize the discovery problem for GFDs. We develop algorithms for discovering GFDs and computing their covers. Moreover, we show that GFD discovery is feasible over large-scale graphs, by providing parallel scalable algorithms that guarantee to reduce running time when more processors are used. Using real-life and synthetic data, we experimentally verify the effectiveness and scalability of the algorithms.

图函数依赖(GFDs)是定义在图上的一类函数依赖。我们研究了与GFD发现有关的三个基本问题的固定参数可追溯性。我们证明了隐含性和可满足性问题是固定参数可处理的，但验证问题通常是co-W[1]-难的。我们引入了约简GFDs及其拓扑支持的概念，并形式化了GFDs的发现问题。我们开发了发现gfd和计算其覆盖范围的算法。此外，我们通过提供并行可扩展算法来保证在使用更多处理器时减少运行时间，证明了在大规模图上发现GFD是可行的。利用实际数据和合成数据，实验验证了算法的有效性和可扩展性。

引用次数: 12

Packing R-trees with Space-filling Curves 用空间填充曲线填充r树

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-08-26 DOI: 10.1145/3397506

Jianzhong Qi, Yufei Tao, Yanchuan Chang, Rui Zhang

The massive amount of data and large variety of data distributions in the big data era call for access methods that are efficient in both query processing and index management, and over both practical and worst-case workloads. To address this need, we revisit two classic multidimensional access methods—the R-tree and the space-filling curve. We propose a novel R-tree packing strategy based on space-filling curves. This strategy produces R-trees with an asymptotically optimal I/O complexity for window queries in the worst case. Experiments show that our R-trees are highly efficient in querying both real and synthetic data of different distributions. The proposed strategy is also simple to parallelize, since it relies only on sorting. We propose a parallel algorithm for R-tree bulk-loading based on the proposed packing strategy and analyze its performance under the massively parallel communication model. To handle dynamic data updates, we further propose index update algorithms that process data insertions and deletions without compromising the optimal query I/O complexity. Experimental results confirm the effectiveness and efficiency of the proposed R-tree bulk-loading and updating algorithms over large data sets.

在大数据时代，大量的数据和各种各样的数据分布要求访问方法在查询处理和索引管理方面都是高效的，并且在实际和最坏的工作负载下都是高效的。为了满足这一需求，我们回顾两种经典的多维访问方法——r树和空间填充曲线。提出了一种基于空间填充曲线的r树填充策略。在最坏的情况下，该策略产生的r树对于窗口查询具有渐近最优的I/O复杂度。实验表明，我们的r树在查询不同分布的真实数据和合成数据方面都是高效的。所提出的策略也很容易并行化，因为它只依赖于排序。在此基础上提出了一种r树批量加载并行算法，并分析了该算法在大规模并行通信模型下的性能。为了处理动态数据更新，我们进一步提出索引更新算法，在不影响最佳查询I/O复杂度的情况下处理数据插入和删除。实验结果证实了该算法在大数据集上的有效性和高效性。

{"title":"Packing R-trees with Space-filling Curves","authors":"Jianzhong Qi, Yufei Tao, Yanchuan Chang, Rui Zhang","doi":"10.1145/3397506","DOIUrl":"https://doi.org/10.1145/3397506","url":null,"abstract":"The massive amount of data and large variety of data distributions in the big data era call for access methods that are efficient in both query processing and index management, and over both practical and worst-case workloads. To address this need, we revisit two classic multidimensional access methods—the R-tree and the space-filling curve. We propose a novel R-tree packing strategy based on space-filling curves. This strategy produces R-trees with an asymptotically optimal I/O complexity for window queries in the worst case. Experiments show that our R-trees are highly efficient in querying both real and synthetic data of different distributions. The proposed strategy is also simple to parallelize, since it relies only on sorting. We propose a parallel algorithm for R-tree bulk-loading based on the proposed packing strategy and analyze its performance under the massively parallel communication model. To handle dynamic data updates, we further propose index update algorithms that process data insertions and deletions without compromising the optimal query I/O complexity. Experimental results confirm the effectiveness and efficiency of the proposed R-tree bulk-loading and updating algorithms over large data sets.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"8 1","pages":"1 - 47"},"PeriodicalIF":0.0,"publicationDate":"2020-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83772994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Synthesis of Incremental Linear Algebra Programs 增量线性代数程序的综合

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-08-26 DOI: 10.1145/3385398

A. Shaikhha, Mohammed Elseidy, Stephan Mihaila, Daniel Espino, Christoph E. Koch

This article targets the Incremental View Maintenance (IVM) of sophisticated analytics (such as statistical models, machine learning programs, and graph algorithms) expressed as linear algebra programs. We present LAGO, a unified framework for linear algebra that automatically synthesizes efficient incremental trigger programs, thereby freeing the user from error-prone manual derivations, performance tuning, and low-level implementation details. The key technique underlying our framework is abstract interpretation, which is used to infer various properties of analytical programs. These properties give the reasoning power required for the automatic synthesis of efficient incremental triggers. We evaluate the effectiveness of our framework on a wide range of applications from regression models to graph computations.

本文的目标是用线性代数程序表示的复杂分析(如统计模型、机器学习程序和图算法)的增量视图维护(IVM)。我们提出了LAGO，一个用于线性代数的统一框架，可以自动合成高效的增量触发程序，从而将用户从容易出错的手动推导、性能调优和低级实现细节中解放出来。我们的框架的关键技术是抽象解释，它用于推断分析程序的各种属性。这些属性提供了自动合成有效增量触发器所需的推理能力。我们评估了我们的框架在从回归模型到图计算的广泛应用中的有效性。

引用次数: 2

Efficient Discovery of Matching Dependencies 匹配依赖的有效发现

ACM Transactions on Database Systems (TODS)

Pub Date : 2020-08-26 DOI: 10.1145/3392778

P. Schirmer, Thorsten Papenbrock, Ioannis K. Koumarelas, Felix Naumann

Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.

匹配依赖项(MDs)是数据分析结果，通常用于数据集成、数据清理和实体匹配。它们是匹配相似元素而不是相同元素的功能依赖关系(fd)的泛化。由于它们的发现非常困难，现有的分析算法要么只能找到所有MDs的一小部分子集，要么它们的范围仅限于小数据集。我们专注于在真实世界的数据集中有效地发现所有有趣的MDs。为此，我们提出了一种新的MD发现算法HyMD，它可以在给定的相似性边界内找到所有最小的非平凡MD。该算法从数据中提取单个MDs的精确相似度阈值，而不是使用预定义的相似度阈值。因此，它是第一个准确、真正完整地解决MD发现问题的方法。但是，如果需要，算法可以在报告的MDs上强制执行某些属性，例如不连接和最小支持，以便将发现集中在下游用例实际需要的结果上。HyMD在技术上是一种混合方法，它结合了相关工作中两种最流行的依赖项发现策略:格遍历和从记录对推断。尽管需要为所有MD候选者寻找精确的相似阈值，但该算法仍然能够有效地处理大型数据集，例如大于3gb的数据集。

{"title":"Efficient Discovery of Matching Dependencies","authors":"P. Schirmer, Thorsten Papenbrock, Ioannis K. Koumarelas, Felix Naumann","doi":"10.1145/3392778","DOIUrl":"https://doi.org/10.1145/3392778","url":null,"abstract":"Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"13 1","pages":"1 - 33"},"PeriodicalIF":0.0,"publicationDate":"2020-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90780065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Transactions on Database Systems (TODS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀