ACM Transactions on Database Systems (TODS)最新文献_第9页

Fast and Accurate Time-Series Clustering 快速准确的时间序列聚类

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-06-01 DOI: 10.1145/3044711

John Paparrizos, L. Gravano

The proliferation and ubiquity of temporal data across many disciplines has generated substantial interest in the analysis and mining of time series. Clustering is one of the most popular data-mining methods, not only due to its exploratory power but also because it is often a preprocessing step or subroutine for other techniques. In this article, we present k-Shape and k-MultiShapes (k-MS), two novel algorithms for time-series clustering. k-Shape and k-MS rely on a scalable iterative refinement procedure. As their distance measure, k-Shape and k-MS use shape-based distance (SBD), a normalized version of the cross-correlation measure, to consider the shapes of time series while comparing them. Based on the properties of SBD, we develop two new methods, namely ShapeExtraction (SE) and MultiShapesExtraction (MSE), to compute cluster centroids that are used in every iteration to update the assignment of time series to clusters. k-Shape relies on SE to compute a single centroid per cluster based on all time series in each cluster. In contrast, k-MS relies on MSE to compute multiple centroids per cluster to account for the proximity and spatial distribution of time series in each cluster. To demonstrate the robustness of SBD, k-Shape, and k-MS, we perform an extensive experimental evaluation on 85 datasets against state-of-the-art distance measures and clustering methods for time series using rigorous statistical analysis. SBD, our efficient and parameter-free distance measure, achieves similar accuracy to Dynamic Time Warping (DTW), a highly accurate but computationally expensive distance measure that requires parameter tuning. For clustering, we compare k-Shape and k-MS against scalable and non-scalable partitional, hierarchical, spectral, density-based, and shapelet-based methods, with combinations of the most competitive distance measures. k-Shape outperforms all scalable methods in terms of accuracy. Furthermore, k-Shape also outperforms all non-scalable approaches, with one exception, namely k-medoids with DTW, which achieves similar accuracy. However, unlike k-Shape, this approach requires tuning of its distance measure and is significantly slower than k-Shape. k-MS performs similarly to k-Shape in comparison to rival methods, but k-MS is significantly more accurate than k-Shape. Beyond clustering, we demonstrate the effectiveness of k-Shape to reduce the search space of one-nearest-neighbor classifiers for time series. Overall, SBD, k-Shape, and k-MS emerge as domain-independent, highly accurate, and efficient methods for time-series comparison and clustering with broad applications.

时间数据在许多学科中的扩散和无处不在，已经对时间序列的分析和挖掘产生了极大的兴趣。聚类是最流行的数据挖掘方法之一，不仅因为它具有探索性，还因为它通常是其他技术的预处理步骤或子例程。在本文中，我们提出了k-Shape和k-MultiShapes (k-MS)这两种新的时间序列聚类算法。k-Shape和k-MS依赖于可扩展的迭代细化过程。作为距离度量，k-Shape和k-MS使用基于形状的距离(SBD)，一种标准化的相互关联度量，在比较它们时考虑时间序列的形状。基于SBD的特性，我们开发了ShapeExtraction (SE)和MultiShapesExtraction (MSE)两种新的方法来计算聚类质心，并在每次迭代中使用聚类质心来更新时间序列对聚类的分配。k-Shape依赖于SE，基于每个簇中的所有时间序列计算每个簇的单个质心。相比之下，k-MS依靠MSE计算每个聚类的多个质心，以考虑每个聚类中时间序列的接近性和空间分布。为了证明SBD、k-Shape和k-MS的稳健性，我们对85个数据集进行了广泛的实验评估，采用最先进的距离测量和时间序列聚类方法，使用严格的统计分析。SBD是一种高效且无参数的距离测量方法，其精度与动态时间翘曲(DTW)相似。动态时间翘曲是一种高精度的距离测量方法，但需要参数调整，计算成本很高。对于聚类，我们将k-Shape和k-MS与可扩展和不可扩展的分区、分层、光谱、基于密度和基于形状的方法进行比较，并结合最具竞争力的距离度量。k-Shape在精度方面优于所有可扩展方法。此外，k-Shape也优于所有不可扩展的方法，只有一个例外，即带有DTW的k- medioid，它达到了类似的精度。然而，与k-Shape不同的是，这种方法需要调整其距离度量，并且比k-Shape慢得多。与竞争对手的方法相比，k-MS的性能与k-Shape相似，但k-MS明显比k-Shape更准确。除了聚类，我们还证明了k-Shape在减少时间序列的一个最近邻分类器的搜索空间方面的有效性。总的来说，SBD、k-Shape和k-MS是独立于域的、高精度的、高效的时间序列比较和聚类方法，具有广泛的应用。

{"title":"Fast and Accurate Time-Series Clustering","authors":"John Paparrizos, L. Gravano","doi":"10.1145/3044711","DOIUrl":"https://doi.org/10.1145/3044711","url":null,"abstract":"The proliferation and ubiquity of temporal data across many disciplines has generated substantial interest in the analysis and mining of time series. Clustering is one of the most popular data-mining methods, not only due to its exploratory power but also because it is often a preprocessing step or subroutine for other techniques. In this article, we present k-Shape and k-MultiShapes (k-MS), two novel algorithms for time-series clustering. k-Shape and k-MS rely on a scalable iterative refinement procedure. As their distance measure, k-Shape and k-MS use shape-based distance (SBD), a normalized version of the cross-correlation measure, to consider the shapes of time series while comparing them. Based on the properties of SBD, we develop two new methods, namely ShapeExtraction (SE) and MultiShapesExtraction (MSE), to compute cluster centroids that are used in every iteration to update the assignment of time series to clusters. k-Shape relies on SE to compute a single centroid per cluster based on all time series in each cluster. In contrast, k-MS relies on MSE to compute multiple centroids per cluster to account for the proximity and spatial distribution of time series in each cluster. To demonstrate the robustness of SBD, k-Shape, and k-MS, we perform an extensive experimental evaluation on 85 datasets against state-of-the-art distance measures and clustering methods for time series using rigorous statistical analysis. SBD, our efficient and parameter-free distance measure, achieves similar accuracy to Dynamic Time Warping (DTW), a highly accurate but computationally expensive distance measure that requires parameter tuning. For clustering, we compare k-Shape and k-MS against scalable and non-scalable partitional, hierarchical, spectral, density-based, and shapelet-based methods, with combinations of the most competitive distance measures. k-Shape outperforms all scalable methods in terms of accuracy. Furthermore, k-Shape also outperforms all non-scalable approaches, with one exception, namely k-medoids with DTW, which achieves similar accuracy. However, unlike k-Shape, this approach requires tuning of its distance measure and is significantly slower than k-Shape. k-MS performs similarly to k-Shape in comparison to rival methods, but k-MS is significantly more accurate than k-Shape. Beyond clustering, we demonstrate the effectiveness of k-Shape to reduce the search space of one-nearest-neighbor classifiers for time series. Overall, SBD, k-Shape, and k-MS emerge as domain-independent, highly accurate, and efficient methods for time-series comparison and clustering with broad applications.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"2 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84015427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 145

COMPRESS 压缩

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-05-10 DOI: 10.1145/3015457

Yunheng Han, Weiwei Sun, Baihua Zheng

More and more advanced technologies have become available to collect and integrate an unprecedented amount of data from multiple sources, including GPS trajectories about the traces of moving objects. Given the fact that GPS trajectories are vast in size while the information carried by the trajectories could be redundant, we focus on trajectory compression in this article. As a systematic solution, we propose a comprehensive framework, namely, COMPRESS (Comprehensive Paralleled Road-Network-Based Trajectory Compression), to compress GPS trajectory data in an urban road network. In the preprocessing step, COMPRESS decomposes trajectories into spatial paths and temporal sequences, with a thorough justification for trajectory decomposition. In the compression step, COMPRESS performs spatial compression on spatial paths, and temporal compression on temporal sequences in parallel. It introduces two alternative algorithms with different strengths for lossless spatial compression and designs lossy but error-bounded algorithms for temporal compression. It also presents query processing algorithms to support error-bounded location-based queries on compressed trajectories without full decompression. All algorithms under COMPRESS are efficient and have the time complexity of O(|T|), where |T| is the size of the input trajectory T. We have also conducted a comprehensive experimental study to demonstrate the effectiveness of COMPRESS, whose compression ratio is significantly better than related approaches.

越来越多的先进技术可以从多个来源收集和整合前所未有的大量数据，包括关于移动物体轨迹的GPS轨迹。考虑到GPS轨迹尺寸较大，而轨迹所携带的信息可能是冗余的，本文重点研究了轨迹压缩问题。作为一种系统的解决方案，我们提出了一个综合框架，即COMPRESS (comprehensive parallel road - network - based Trajectory Compression，综合并行路网轨迹压缩)来压缩城市路网中的GPS轨迹数据。在预处理步骤中，COMPRESS将轨迹分解为空间路径和时间序列，并对轨迹分解进行了充分的论证。在压缩步骤中，COMPRESS对空间路径进行空间压缩，同时对时间序列进行时间压缩。介绍了两种不同强度的空间无损压缩算法，设计了有损但误差有界的时间压缩算法。它还提出了查询处理算法，以支持在没有完全解压缩的压缩轨迹上基于错误边界的位置查询。COMPRESS下的所有算法都是高效的，时间复杂度为O(|T|)，其中|T|为输入轨迹T的大小。我们也进行了全面的实验研究，证明了COMPRESS的有效性，压缩比明显优于相关方法。

{"title":"COMPRESS","authors":"Yunheng Han, Weiwei Sun, Baihua Zheng","doi":"10.1145/3015457","DOIUrl":"https://doi.org/10.1145/3015457","url":null,"abstract":"More and more advanced technologies have become available to collect and integrate an unprecedented amount of data from multiple sources, including GPS trajectories about the traces of moving objects. Given the fact that GPS trajectories are vast in size while the information carried by the trajectories could be redundant, we focus on trajectory compression in this article. As a systematic solution, we propose a comprehensive framework, namely, COMPRESS (Comprehensive Paralleled Road-Network-Based Trajectory Compression), to compress GPS trajectory data in an urban road network. In the preprocessing step, COMPRESS decomposes trajectories into spatial paths and temporal sequences, with a thorough justification for trajectory decomposition. In the compression step, COMPRESS performs spatial compression on spatial paths, and temporal compression on temporal sequences in parallel. It introduces two alternative algorithms with different strengths for lossless spatial compression and designs lossy but error-bounded algorithms for temporal compression. It also presents query processing algorithms to support error-bounded location-based queries on compressed trajectories without full decompression. All algorithms under COMPRESS are efficient and have the time complexity of O(|T|), where |T| is the size of the input trajectory T. We have also conducted a comprehensive experimental study to demonstrate the effectiveness of COMPRESS, whose compression ratio is significantly better than related approaches.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"17 1","pages":"1 - 49"},"PeriodicalIF":0.0,"publicationDate":"2017-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72969467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

An Indexing Framework for Queries on Probabilistic Graphs 概率图查询的索引框架

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-05-10 DOI: 10.1145/3044713

S. Maniu, Reynold Cheng, P. Senellart

Information in many applications, such as mobile wireless systems, social networks, and road networks, is captured by graphs. In many cases, such information is uncertain. We study the problem of querying a probabilistic graph, in which vertices are connected to each other probabilistically. In particular, we examine “source-to-target” queries (ST-queries), such as computing the shortest path between two vertices. The major difference with the deterministic setting is that query answers are enriched with probabilistic annotations. Evaluating ST-queries over probabilistic graphs is #P-hard, as it requires examining an exponential number of “possible worlds”—database instances generated from the probabilistic graph. Existing solutions to the ST-query problem, which sample possible worlds, have two downsides: (i) a possible world can be very large and (ii) many samples are needed for reasonable accuracy. To tackle these issues, we study the ProbTree, a data structure that stores a succinct, or indexed, version of the possible worlds of the graph. Existing ST-query solutions are executed on top of this structure, with the number of samples and sizes of the possible worlds reduced. We examine lossless and lossy methods for generating the ProbTree, which reflect the tradeoff between the accuracy and efficiency of query evaluation. We analyze the correctness and complexity of these approaches. Our extensive experiments on real datasets show that the ProbTree is fast to generate and small in size. It also enhances the accuracy and efficiency of existing ST-query algorithms significantly.

许多应用程序中的信息，如移动无线系统、社交网络和道路网络，都是通过图形捕获的。在许多情况下，这些信息是不确定的。我们研究了一个概率图的查询问题，其中的顶点之间是概率连接的。特别地，我们将研究“源到目标”查询(st查询)，例如计算两个顶点之间的最短路径。与确定性设置的主要区别在于，查询答案使用概率注释进行了充实。在概率图上评估st查询是#P-hard，因为它需要检查指数数量的“可能世界”——从概率图生成的数据库实例。st查询问题的现有解决方案(对可能世界进行采样)有两个缺点:(i)可能世界可能非常大，(ii)需要许多样本才能达到合理的精度。为了解决这些问题，我们研究了ProbTree，这是一种数据结构，用于存储图的可能世界的简洁或索引版本。现有的st查询解决方案在此结构之上执行，减少了样本数量和可能世界的大小。我们研究了生成ProbTree的无损和有损方法，它们反映了查询评估的准确性和效率之间的权衡。我们分析了这些方法的正确性和复杂性。我们在真实数据集上的大量实验表明，ProbTree生成速度快，体积小。该算法还显著提高了现有st -查询算法的准确性和效率。

{"title":"An Indexing Framework for Queries on Probabilistic Graphs","authors":"S. Maniu, Reynold Cheng, P. Senellart","doi":"10.1145/3044713","DOIUrl":"https://doi.org/10.1145/3044713","url":null,"abstract":"Information in many applications, such as mobile wireless systems, social networks, and road networks, is captured by graphs. In many cases, such information is uncertain. We study the problem of querying a probabilistic graph, in which vertices are connected to each other probabilistically. In particular, we examine “source-to-target” queries (ST-queries), such as computing the shortest path between two vertices. The major difference with the deterministic setting is that query answers are enriched with probabilistic annotations. Evaluating ST-queries over probabilistic graphs is #P-hard, as it requires examining an exponential number of “possible worlds”—database instances generated from the probabilistic graph. Existing solutions to the ST-query problem, which sample possible worlds, have two downsides: (i) a possible world can be very large and (ii) many samples are needed for reasonable accuracy. To tackle these issues, we study the ProbTree, a data structure that stores a succinct, or indexed, version of the possible worlds of the graph. Existing ST-query solutions are executed on top of this structure, with the number of samples and sizes of the possible worlds reduced. We examine lossless and lossy methods for generating the ProbTree, which reflect the tradeoff between the accuracy and efficiency of query evaluation. We analyze the correctness and complexity of these approaches. Our extensive experiments on real datasets show that the ProbTree is fast to generate and small in size. It also enhances the accuracy and efficiency of existing ST-query algorithms significantly.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"33 1","pages":"1 - 34"},"PeriodicalIF":0.0,"publicationDate":"2017-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78743167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Approximation Algorithms for Schema-Mapping Discovery from Data Examples 从数据示例中发现模式映射的近似算法

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-04-28 DOI: 10.1145/3044712

B. T. Cate, Phokion G. Kolaitis, Kun Qian, W. Tan

In recent years, data examples have been at the core of several different approaches to schema-mapping design. In particular, Gottlob and Senellart introduced a framework for schema-mapping discovery from a single data example, in which the derivation of a schema mapping is cast as an optimization problem. Our goal is to refine and study this framework in more depth. Among other results, we design a polynomial-time log(n)-approximation algorithm for computing optimal schema mappings from a given set of data examples (where n is the combined size of the given data examples) for a restricted class of schema mappings; moreover, we show that this approximation ratio cannot be improved. In addition to the complexity-theoretic results, we implemented the aforementioned log(n)-approximation algorithm and carried out an experimental evaluation in a real-world mapping scenario.

近年来，数据示例已成为模式映射设计的几种不同方法的核心。特别是，Gottlob和Senellart引入了一个框架，用于从单个数据示例中发现模式映射，在该框架中，模式映射的派生被视为一个优化问题。我们的目标是更深入地完善和研究这个框架。在其他结果中，我们设计了一个多项式时间log(n)近似算法，用于从给定的一组数据示例(其中n是给定数据示例的组合大小)中计算限制类型的模式映射的最优模式映射;此外，我们还证明了这种近似比率不能再提高。除了复杂性理论结果外，我们还实现了上述log(n)逼近算法，并在现实世界的映射场景中进行了实验评估。

引用次数: 20

Outlier Detection over Massive-Scale Trajectory Streams 大规模轨迹流的异常值检测

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-04-28 DOI: 10.1145/3013527

Yanwei Yu, Lei Cao, Elke A. Rundensteiner, Qin Wang

The detection of abnormal moving objects over high-volume trajectory streams is critical for real-time applications ranging from military surveillance to transportation management. Yet this outlier detection problem, especially along both the spatial and temporal dimensions, remains largely unexplored. In this work, we propose a rich taxonomy of novel classes of neighbor-based trajectory outlier definitions that model the anomalous behavior of moving objects for a large range of real-time applications. Our theoretical analysis and empirical study on two real-world datasets—the Beijing Taxi trajectory data and the Ground Moving Target Indicator data stream—and one generated Moving Objects dataset demonstrate the effectiveness of our taxonomy in effectively capturing different types of abnormal moving objects. Furthermore, we propose a general strategy for efficiently detecting these new outlier classes called the minimal examination (MEX) framework. The MEX framework features three core optimization principles, which leverage spatiotemporal as well as the predictability properties of the neighbor evidence to minimize the detection costs. Based on this foundation, we design algorithms that detect the outliers based on these classes of new outlier semantics that successfully leverage our optimization principles. Our comprehensive experimental study demonstrates that our proposed MEX strategy drives the detection costs 100-fold down into the practical realm for applications that analyze high-volume trajectory streams in near real time.

在高容量轨迹流中检测异常移动物体对于从军事监视到运输管理的实时应用至关重要。然而，这种异常值检测问题，特别是在空间和时间维度上，在很大程度上仍未被探索。在这项工作中，我们提出了一种丰富的新类别的基于邻居的轨迹离群值定义的分类，这些定义为大范围的实时应用模拟了运动物体的异常行为。我们对两个真实世界数据集(北京出租车轨迹数据和地面运动目标指示数据流)和一个生成的运动物体数据集进行了理论分析和实证研究，证明了我们的分类方法在有效捕获不同类型的异常运动物体方面的有效性。此外，我们提出了一种用于有效检测这些新的异常类的通用策略，称为最小检查(MEX)框架。MEX框架具有三个核心优化原则，它们利用相邻证据的时空和可预测性特性来最大限度地降低检测成本。在此基础上，我们设计了基于这些新离群语义的算法来检测离群值，这些算法成功地利用了我们的优化原则。我们的综合实验研究表明，我们提出的MEX策略将检测成本降低了100倍，适用于近实时分析大容量轨迹流的应用。

{"title":"Outlier Detection over Massive-Scale Trajectory Streams","authors":"Yanwei Yu, Lei Cao, Elke A. Rundensteiner, Qin Wang","doi":"10.1145/3013527","DOIUrl":"https://doi.org/10.1145/3013527","url":null,"abstract":"The detection of abnormal moving objects over high-volume trajectory streams is critical for real-time applications ranging from military surveillance to transportation management. Yet this outlier detection problem, especially along both the spatial and temporal dimensions, remains largely unexplored. In this work, we propose a rich taxonomy of novel classes of neighbor-based trajectory outlier definitions that model the anomalous behavior of moving objects for a large range of real-time applications. Our theoretical analysis and empirical study on two real-world datasets—the Beijing Taxi trajectory data and the Ground Moving Target Indicator data stream—and one generated Moving Objects dataset demonstrate the effectiveness of our taxonomy in effectively capturing different types of abnormal moving objects. Furthermore, we propose a general strategy for efficiently detecting these new outlier classes called the minimal examination (MEX) framework. The MEX framework features three core optimization principles, which leverage spatiotemporal as well as the predictability properties of the neighbor evidence to minimize the detection costs. Based on this foundation, we design algorithms that detect the outliers based on these classes of new outlier semantics that successfully leverage our optimization principles. Our comprehensive experimental study demonstrates that our proposed MEX strategy drives the detection costs 100-fold down into the practical realm for applications that analyze high-volume trajectory streams in near real time.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"59 1","pages":"1 - 33"},"PeriodicalIF":0.0,"publicationDate":"2017-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78869042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Learning Models over Relational Data Using Sparse Tensors and Functional Dependencies 使用稀疏张量和函数依赖的关系数据学习模型

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-03-14 DOI: 10.1145/3375661

Mahmoud Abo Khamis, H. Ngo, X. Nguyen, Dan Olteanu, Maximilian Schleich

Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them. This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.

关系数据库分析的集成解决方案具有重要的实际意义，因为它们避免了数据科学家每天必须处理的昂贵的重复循环:使用涉及连接、投影和聚合的特征提取查询从关系数据库中的数据中选择特征;导出由这些查询定义的训练数据集;将该数据集转换为外部学习工具的格式;并使用此工具训练所需的模型。这些集成的解决方案也是关系数据模型和统计数据模型交叉领域的理论基础和挑战性问题的沃土。本文介绍了一个统一的框架，用于在关系数据库上训练和评估一类统计学习模型。本课程包括岭线性回归、多项式回归、因式分解机和主成分分析。我们表明，通过协同数据库理论(如模式信息、查询结构、功能依赖、查询评估算法的最新进展)和线性代数(如张量和矩阵运算)中的关键工具，可以制定关系分析问题，并设计有效的(查询和数据)结构感知算法来解决这些问题。这一理论发展为结构感知学习的AC/DC系统的设计和实现提供了依据。我们将AC/DC的性能与R、MADlib、libFM和TensorFlow进行了基准测试。对于典型的零售预测和广告规划应用，AC/DC可以学习多项式回归模型和分解机器，其精度至少与竞争对手相同，并且在内存不足，超过24小时超时或遇到内部设计限制的情况下，比竞争对手快三个数量级。

{"title":"Learning Models over Relational Data Using Sparse Tensors and Functional Dependencies","authors":"Mahmoud Abo Khamis, H. Ngo, X. Nguyen, Dan Olteanu, Maximilian Schleich","doi":"10.1145/3375661","DOIUrl":"https://doi.org/10.1145/3375661","url":null,"abstract":"Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them. This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"40 1","pages":"1 - 66"},"PeriodicalIF":0.0,"publicationDate":"2017-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77247486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Answering FO+MOD Queries under Updates on Bounded Degree Databases 回答在有限程度数据库更新下的FO+MOD查询

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-02-28 DOI: 10.1145/3232056

Christoph Berkholz, Jens Keppeler, Nicole Schweikardt

We investigate the query evaluation problem for fixed queries over fully dynamic databases, where tuples can be inserted or deleted. The task is to design a dynamic algorithm that immediately reports the new result of a fixed query after every database update. We consider queries in first-order logic (FO) and its extension with modulo-counting quantifiers (FO+MOD) and show that they can be efficiently evaluated under updates, provided that the dynamic database does not exceed a certain degree bound. In particular, we construct a data structure that allows us to answer a Boolean FO+MOD query and to compute the size of the result of a non-Boolean query within constant time after every database update. Furthermore, after every database update, we can update the data structure in constant time such that afterwards we are able to test within constant time for a given tuple whether or not it belongs to the query result, to enumerate all tuples in the new query result, and to enumerate the difference between the old and the new query result with constant delay between the output tuples. The preprocessing time needed to build the data structure is linear in the size of the database. Our results extend earlier work on the evaluation of first-order queries on static databases of bounded degree and rely on an effective Hanf normal form for FO+MOD recently obtained by Heimberg, Kuske, and Schweikardt (LICS 2016).

我们研究了全动态数据库上的固定查询的查询求值问题，其中元组可以插入或删除。任务是设计一个动态算法，在每次数据库更新后立即报告固定查询的新结果。我们考虑了一阶逻辑(FO)查询及其扩展中的模计数量词(FO+MOD)，并证明了在动态数据库不超过一定程度界的情况下，它们可以有效地在更新下求值。特别是，我们构建了一个数据结构，使我们能够回答布尔FO+MOD查询，并在每次数据库更新后的恒定时间内计算非布尔查询的结果大小。此外，在每次数据库更新之后，我们可以在恒定时间内更新数据结构，以便之后我们能够在恒定时间内测试给定元组是否属于查询结果，枚举新查询结果中的所有元组，并在输出元组之间具有恒定延迟的情况下枚举新旧查询结果之间的差异。构建数据结构所需的预处理时间与数据库的大小成线性关系。我们的研究结果扩展了早期对有界度静态数据库一阶查询的评估工作，并依赖于最近由Heimberg、Kuske和Schweikardt (LICS 2016)获得的FO+MOD的有效汉夫范式。

{"title":"Answering FO+MOD Queries under Updates on Bounded Degree Databases","authors":"Christoph Berkholz, Jens Keppeler, Nicole Schweikardt","doi":"10.1145/3232056","DOIUrl":"https://doi.org/10.1145/3232056","url":null,"abstract":"We investigate the query evaluation problem for fixed queries over fully dynamic databases, where tuples can be inserted or deleted. The task is to design a dynamic algorithm that immediately reports the new result of a fixed query after every database update. We consider queries in first-order logic (FO) and its extension with modulo-counting quantifiers (FO+MOD) and show that they can be efficiently evaluated under updates, provided that the dynamic database does not exceed a certain degree bound. In particular, we construct a data structure that allows us to answer a Boolean FO+MOD query and to compute the size of the result of a non-Boolean query within constant time after every database update. Furthermore, after every database update, we can update the data structure in constant time such that afterwards we are able to test within constant time for a given tuple whether or not it belongs to the query result, to enumerate all tuples in the new query result, and to enumerate the difference between the old and the new query result with constant delay between the output tuples. The preprocessing time needed to build the data structure is linear in the size of the database. Our results extend earlier work on the evaluation of first-order queries on static databases of bounded degree and rely on an effective Hanf normal form for FO+MOD recently obtained by Heimberg, Kuske, and Schweikardt (LICS 2016).","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"28 1","pages":"1 - 32"},"PeriodicalIF":0.0,"publicationDate":"2017-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75159349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Exact Model Counting of Query Expressions 查询表达式的精确模型计数

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-02-03 DOI: 10.1145/2984632

P. Beame, Jerry Li, Sudeepa Roy, Dan Suciu

We prove exponential lower bounds on the running time of the state-of-the-art exact model counting algorithms—algorithms for exactly computing the number of satisfying assignments, or the satisfying probability, of Boolean formulas. These algorithms can be seen, either directly or indirectly, as building Decision-Decomposable Negation Normal Form (decision-DNNF) representations of the input Boolean formulas. Decision-DNNFs are a special case of d-DNNFs where d stands for deterministic. We show that any knowledge compilation representations from a class (called DLDDs in this article) that contain decision-DNNFs can be converted into equivalent Free Binary Decision Diagrams (FBDDs), also known as Read-Once Branching Programs, with only a quasi-polynomial increase in representation size. Leveraging known exponential lower bounds for FBDDs, we then obtain similar exponential lower bounds for decision-DNNFs, which imply exponential lower bounds for model-counting algorithms. We also separate the power of decision-DNNFs from d-DNNFs and a generalization of decision-DNNFs known as AND-FBDDs. We then prove new lower bounds for FBDDs that yield exponential lower bounds on the running time of these exact model counters when applied to the problem of query evaluation in tuple-independent probabilistic databases—computing the probability of an answer to a query given independent probabilities of the individual tuples in a database instance. This approach to the query evaluation problem, in which one first obtains the lineage for the query and database instance as a Boolean formula and then performs weighted model counting on the lineage, is known as grounded inference. A second approach, known as lifted inference or extensional query evaluation, exploits the high-level structure of the query as a first-order formula. Although it has been widely believed that lifted inference is strictly more powerful than grounded inference on the lineage alone, no formal separation has previously been shown for query evaluation. In this article, we show such a formal separation for the first time. In particular, we exhibit a family of database queries for which polynomial-time extensional query evaluation techniques were previously known but for which query evaluation via grounded inference using the state-of-the-art exact model counters requires exponential time.

我们证明了最先进的精确模型计数算法的运行时间的指数下界——精确计算布尔公式的满足赋值的数量或满足概率的算法。这些算法可以直接或间接地看作是为输入布尔公式构建决策可分解否定范式(decision-DNNF)表示。Decision-DNNFs是d- dnnfs的一种特殊情况，其中d代表确定性。我们展示了包含Decision - dnnf的类(本文中称为DLDDs)中的任何知识编译表示都可以转换为等效的自由二进制决策图(fbdd)，也称为一次读分支程序，表示大小仅增加准多项式。利用已知的fbdd的指数下界，我们得到了类似的决策- dnnf的指数下界，这意味着模型计数算法的指数下界。我们还将决策- dnnfs的权力与d-DNNFs和决策- dnnfs的概括(称为and - fbdd)分开。然后，我们证明了fbdd的新下界，当应用于元组独立概率数据库中的查询求值问题时，这些精确模型计数器的运行时间产生指数下界——给定数据库实例中单个元组的独立概率，计算查询得到答案的概率。这种处理查询求值问题的方法被称为基于推理(grounded inference)，即首先以布尔公式的形式获得查询和数据库实例的沿袭，然后对沿袭执行加权模型计数。第二种方法称为提升推理或扩展查询求值，它利用查询的高级结构作为一阶公式。虽然人们普遍认为，提升推理严格地比仅对谱系进行扎根推理更强大，但以前没有显示过对查询评估的正式分离。在本文中，我们将首次展示这种正式的分离。特别是，我们展示了一系列数据库查询，这些查询以前已知多项式时间扩展查询评估技术，但通过使用最先进的精确模型计数器进行基于推理的查询评估需要指数时间。

{"title":"Exact Model Counting of Query Expressions","authors":"P. Beame, Jerry Li, Sudeepa Roy, Dan Suciu","doi":"10.1145/2984632","DOIUrl":"https://doi.org/10.1145/2984632","url":null,"abstract":"We prove exponential lower bounds on the running time of the state-of-the-art exact model counting algorithms—algorithms for exactly computing the number of satisfying assignments, or the satisfying probability, of Boolean formulas. These algorithms can be seen, either directly or indirectly, as building Decision-Decomposable Negation Normal Form (decision-DNNF) representations of the input Boolean formulas. Decision-DNNFs are a special case of d-DNNFs where d stands for deterministic. We show that any knowledge compilation representations from a class (called DLDDs in this article) that contain decision-DNNFs can be converted into equivalent Free Binary Decision Diagrams (FBDDs), also known as Read-Once Branching Programs, with only a quasi-polynomial increase in representation size. Leveraging known exponential lower bounds for FBDDs, we then obtain similar exponential lower bounds for decision-DNNFs, which imply exponential lower bounds for model-counting algorithms. We also separate the power of decision-DNNFs from d-DNNFs and a generalization of decision-DNNFs known as AND-FBDDs. We then prove new lower bounds for FBDDs that yield exponential lower bounds on the running time of these exact model counters when applied to the problem of query evaluation in tuple-independent probabilistic databases—computing the probability of an answer to a query given independent probabilities of the individual tuples in a database instance. This approach to the query evaluation problem, in which one first obtains the lineage for the query and database instance as a Boolean formula and then performs weighted model counting on the lineage, is known as grounded inference. A second approach, known as lifted inference or extensional query evaluation, exploits the high-level structure of the query as a first-order formula. Although it has been widely believed that lifted inference is strictly more powerful than grounded inference on the lineage alone, no formal separation has previously been shown for query evaluation. In this article, we show such a formal separation for the first time. In particular, we exhibit a family of database queries for which polynomial-time extensional query evaluation techniques were previously known but for which query evaluation via grounded inference using the state-of-the-art exact model counters requires exponential time.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"60 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2017-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89129547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Response to “Differential Dependencies Revisited” 对“重访差异依赖”的回应

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-01-14 DOI: 10.1145/2983602

Shaoxu Song, Lei Chen

A recent article [Vincent et al. 2015] concerns the correctness of several results in reasoning about differential dependencies (dds), originally reported in Song and Chen [2011]. The major concern by Vincent et al. [2015] roots from assuming a type of infeasible differential functions in the given dds for consistency and implication analysis, which are not allowed in Song and Chen [2011]. A differential function is said to be infeasible if there is no tuple pair with values that can satisfy the specified distance constraints. For example, [price(<2, > 4)] requires the difference of two price values to be < 2 and > 4 at the same time, which is clearly impossible. Although dds involving infeasible differential functions may be syntactically interesting, they are semantically meaningless and would neither be specified by domain experts nor discovered from data. For these reasons, infeasible differential functions are not considered [Song and Chen 2011] and the results in Song and Chen [2011] are correct, in contrast to what is claimed in Vincent et al. [2015].

最近的一篇文章[Vincent et al. 2015]关注了最初在Song和Chen[2011]中报道的关于差异依赖关系(dds)推理的几个结果的正确性。Vincent等人[2015]主要关注的是在给定的dds中假设一种不可行的微分函数进行一致性和含义分析，这在Song和Chen[2011]中是不允许的。如果没有元组对的值能够满足指定的距离约束，那么微分函数就是不可行的。例如，[price(4)]要求两个价格值的差值同时为< 2和> 4，这显然是不可能的。尽管涉及不可行的微分函数的dds可能在语法上很有趣，但它们在语义上毫无意义，既不会由领域专家指定，也不会从数据中发现。由于这些原因，不考虑不可行的微分函数[Song and Chen 2011]， Song和Chen[2011]的结果是正确的，与Vincent等人[2015]的说法相反。

引用次数: 0

Computational Fact Checking through Query Perturbations 通过查询扰动进行计算事实检查

ACM Transactions on Database Systems (TODS)

Pub Date : 2017-01-09 DOI: 10.1145/2996453

You Wu, P. Agarwal, Chengkai Li, Jun Yang, Cong Yu

Our media is saturated with claims of “facts” made from data. Database research has in the past focused on how to answer queries, but has not devoted much attention to discerning more subtle qualities of the resulting claims, for example, is a claim “cherry-picking”? This article proposes a framework that models claims based on structured data as parameterized queries. Intuitively, with its choice of the parameter setting, a claim presents a particular (and potentially biased) view of the underlying data. A key insight is that we can learn a lot about a claim by “perturbing” its parameters and seeing how its conclusion changes. For example, a claim is not robust if small perturbations to its parameters can change its conclusions significantly. This framework allows us to formulate practical fact-checking tasks—reverse-engineering vague claims, and countering questionable claims—as computational problems. Along with the modeling framework, we develop an algorithmic framework that enables efficient instantiations of “meta” algorithms by supplying appropriate algorithmic building blocks. We present real-world examples and experiments that demonstrate the power of our model, efficiency of our algorithms, and usefulness of their results.

我们的媒体充斥着由数据构成的所谓“事实”。过去，数据库研究的重点是如何回答查询，但没有投入太多的精力来辨别所产生的索赔要求的更微妙的品质，例如，索赔要求是否“择优挑选”?本文提出了一个框架，该框架将基于结构化数据的声明建模为参数化查询。直观地说，通过对参数设置的选择，权利要求提供了对基础数据的特定(可能有偏见的)视图。一个关键的洞见是，我们可以通过“扰乱”一个论断的参数，观察它的结论是如何变化的，从而对它有很多了解。例如，如果对其参数的微小扰动可以显著改变其结论，则该主张不具有鲁棒性。这个框架允许我们将实际的事实核查任务——对模糊的声明进行逆向工程，并反驳有问题的声明——作为计算问题来制定。与建模框架一起，我们开发了一个算法框架，该框架通过提供适当的算法构建块来实现“元”算法的有效实例化。我们展示了现实世界的例子和实验，展示了我们模型的力量、算法的效率和结果的有用性。

{"title":"Computational Fact Checking through Query Perturbations","authors":"You Wu, P. Agarwal, Chengkai Li, Jun Yang, Cong Yu","doi":"10.1145/2996453","DOIUrl":"https://doi.org/10.1145/2996453","url":null,"abstract":"Our media is saturated with claims of “facts” made from data. Database research has in the past focused on how to answer queries, but has not devoted much attention to discerning more subtle qualities of the resulting claims, for example, is a claim “cherry-picking”? This article proposes a framework that models claims based on structured data as parameterized queries. Intuitively, with its choice of the parameter setting, a claim presents a particular (and potentially biased) view of the underlying data. A key insight is that we can learn a lot about a claim by “perturbing” its parameters and seeing how its conclusion changes. For example, a claim is not robust if small perturbations to its parameters can change its conclusions significantly. This framework allows us to formulate practical fact-checking tasks—reverse-engineering vague claims, and countering questionable claims—as computational problems. Along with the modeling framework, we develop an algorithmic framework that enables efficient instantiations of “meta” algorithms by supplying appropriate algorithmic building blocks. We present real-world examples and experiments that demonstrate the power of our model, efficiency of our algorithms, and usefulness of their results.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"108 1","pages":"1 - 41"},"PeriodicalIF":0.0,"publicationDate":"2017-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88053462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37