ACM Transactions on Database Systems (TODS)最新文献_第5页

Embedded Functional Dependencies and Data-completeness Tailored Database Design 嵌入式功能依赖和数据完整性定制数据库设计

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-07-01 DOI: 10.14778/3342263.3342626

Ziheng Wei, S. Link

We establish a principled schema design framework for data with missing values. The framework is based on the new notion of an embedded functional dependency, which is independent of the interpretation of missing values, able to express completeness and integrity requirements on application data, and capable of capturing redundant data value occurrences that may cause problems with processing data that meets the requirements. We establish axiomatic, algorithmic, and logical foundations for reasoning about embedded functional dependencies. These foundations enable us to introduce generalizations of Boyce-Codd and Third normal forms that avoid processing difficulties of any application data, or minimize these difficulties across dependency-preserving decompositions, respectively. We show how to transform any given schema into application schemata that meet given completeness and integrity requirements, and the conditions of the generalized normal forms. Data over those application schemata are therefore fit for purpose by design. Extensive experiments with benchmark schemata and data illustrate the effectiveness of our framework for the acquisition of the constraints, the schema design process, and the performance of the schema designs in terms of updates and join queries.

我们为缺失值的数据建立了一个有原则的模式设计框架。该框架基于嵌入式功能依赖的新概念，它独立于对缺失值的解释，能够表达应用程序数据的完整性和完整性需求，并能够捕获可能导致处理满足需求的数据时出现问题的冗余数据值。我们为嵌入式功能依赖关系的推理建立了公理、算法和逻辑基础。这些基础使我们能够引入Boyce-Codd和Third范式的一般化，它们分别避免了任何应用程序数据的处理困难，或者最小化了跨依赖保持分解的这些困难。我们将展示如何将任何给定的模式转换为满足给定的完整性和完整性要求以及广义范式的条件的应用程序模式。因此，这些应用程序模式上的数据符合设计目的。对基准模式和数据的大量实验说明了我们的框架在获取约束、模式设计过程以及模式设计在更新和连接查询方面的性能方面的有效性。

{"title":"Embedded Functional Dependencies and Data-completeness Tailored Database Design","authors":"Ziheng Wei, S. Link","doi":"10.14778/3342263.3342626","DOIUrl":"https://doi.org/10.14778/3342263.3342626","url":null,"abstract":"We establish a principled schema design framework for data with missing values. The framework is based on the new notion of an embedded functional dependency, which is independent of the interpretation of missing values, able to express completeness and integrity requirements on application data, and capable of capturing redundant data value occurrences that may cause problems with processing data that meets the requirements. We establish axiomatic, algorithmic, and logical foundations for reasoning about embedded functional dependencies. These foundations enable us to introduce generalizations of Boyce-Codd and Third normal forms that avoid processing difficulties of any application data, or minimize these difficulties across dependency-preserving decompositions, respectively. We show how to transform any given schema into application schemata that meet given completeness and integrity requirements, and the conditions of the generalized normal forms. Data over those application schemata are therefore fit for purpose by design. Extensive experiments with benchmark schemata and data illustrate the effectiveness of our framework for the acquisition of the constraints, the schema design process, and the performance of the schema designs in terms of updates and join queries.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"3 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86474226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Bag Query Containment and Information Theory 包查询遏制与信息论

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-06-24 DOI: 10.1145/3472391

Mahmoud Abo Khamis, Phokion G. Kolaitis, H. Ngo, Dan Suciu

The query containment problem is a fundamental algorithmic problem in data management. While this problem is well understood under set semantics, it is by far less understood under bag semantics. In particular, it is a long-standing open question whether or not the conjunctive query containment problem under bag semantics is decidable. We unveil tight connections between information theory and the conjunctive query containment under bag semantics. These connections are established using information inequalities, which are considered to be the laws of information theory. Our first main result asserts that deciding the validity of a generalization of information inequalities is many-one equivalent to the restricted case of conjunctive query containment in which the containing query is acyclic; thus, either both these problems are decidable or both are undecidable. Our second main result identifies a new decidable case of the conjunctive query containment problem under bag semantics. Specifically, we give an exponential-time algorithm for conjunctive query containment under bag semantics, provided the containing query is chordal and admits a simple junction tree.

查询包含问题是数据管理中的一个基本算法问题。虽然这个问题在集合语义下很容易理解，但在包语义下却很难理解。特别是包语义下的合取查询包含问题是否可判定是一个长期悬而未决的问题。揭示了信息论与包语义下的合取查询包容之间的紧密联系。这些联系是利用信息不平等建立起来的，这被认为是信息论的规律。我们的第一个主要结果表明，决定信息不等式泛化的有效性是多一等价于包含查询是无循环的合取查询包含的限制情况;因此，这两个问题要么都是可决定的，要么都是不可决定的。我们的第二个主要结果确定了包语义下的联合查询包含问题的一个新的可判定案例。具体地说，我们给出了在包语义下，如果包含查询是弦性的并且允许一个简单的连接树，那么包含查询的指数时间算法。

{"title":"Bag Query Containment and Information Theory","authors":"Mahmoud Abo Khamis, Phokion G. Kolaitis, H. Ngo, Dan Suciu","doi":"10.1145/3472391","DOIUrl":"https://doi.org/10.1145/3472391","url":null,"abstract":"The query containment problem is a fundamental algorithmic problem in data management. While this problem is well understood under set semantics, it is by far less understood under bag semantics. In particular, it is a long-standing open question whether or not the conjunctive query containment problem under bag semantics is decidable. We unveil tight connections between information theory and the conjunctive query containment under bag semantics. These connections are established using information inequalities, which are considered to be the laws of information theory. Our first main result asserts that deciding the validity of a generalization of information inequalities is many-one equivalent to the restricted case of conjunctive query containment in which the containing query is acyclic; thus, either both these problems are decidable or both are undecidable. Our second main result identifies a new decidable case of the conjunctive query containment problem under bag semantics. Specifically, we give an exponential-time algorithm for conjunctive query containment under bag semantics, provided the containing query is chordal and admits a simple junction tree.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"9 1","pages":"1 - 39"},"PeriodicalIF":0.0,"publicationDate":"2019-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74534559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms 从综合实验调查到基于代价的轻量级整数压缩算法选择策略

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-06-17 DOI: 10.1145/3323991

Patrick Damme, A. Ungethüm, Juliana Hildebrandt, Dirk Habich, Wolfgang Lehner

Lightweight integer compression algorithms are frequently applied in in-memory database systems to tackle the growing gap between processor speed and main memory bandwidth. In recent years, the vectorization of basic techniques such as delta coding and null suppression has considerably enlarged the corpus of available algorithms. As a result, today there is a large number of algorithms to choose from, while different algorithms are tailored to different data characteristics. However, a comparative evaluation of these algorithms with different data and hardware characteristics has never been sufficiently conducted in the literature. To close this gap, we conducted an exhaustive experimental survey by evaluating several state-of-the-art lightweight integer compression algorithms as well as cascades of basic techniques. We systematically investigated the influence of data as well as hardware properties on the performance and the compression rates. The evaluated algorithms are based on publicly available implementations as well as our own vectorized reimplementations. We summarize our experimental findings leading to several new insights and to the conclusion that there is no single-best algorithm. Moreover, in this article, we also introduce and evaluate a novel cost model for the selection of a suitable lightweight integer compression algorithm for a given dataset.

轻量级整数压缩算法经常应用于内存数据库系统中，以解决处理器速度和主存带宽之间日益增长的差距。近年来，诸如增量编码和零抑制等基本技术的矢量化极大地扩大了可用算法的语料。因此，今天有大量的算法可供选择，而不同的算法针对不同的数据特征而量身定制。然而，文献中对这些算法在不同数据和硬件特性下的比较评价从未充分进行过。为了缩小这一差距，我们进行了一项详尽的实验调查，评估了几种最先进的轻量级整数压缩算法以及一系列基本技术。我们系统地研究了数据和硬件属性对性能和压缩率的影响。评估的算法基于公开可用的实现以及我们自己的矢量化重新实现。我们总结了我们的实验发现，得出了一些新的见解，并得出了没有单一最佳算法的结论。此外，在本文中，我们还介绍并评估了一种新的成本模型，用于为给定数据集选择合适的轻量级整数压缩算法。

{"title":"From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms","authors":"Patrick Damme, A. Ungethüm, Juliana Hildebrandt, Dirk Habich, Wolfgang Lehner","doi":"10.1145/3323991","DOIUrl":"https://doi.org/10.1145/3323991","url":null,"abstract":"Lightweight integer compression algorithms are frequently applied in in-memory database systems to tackle the growing gap between processor speed and main memory bandwidth. In recent years, the vectorization of basic techniques such as delta coding and null suppression has considerably enlarged the corpus of available algorithms. As a result, today there is a large number of algorithms to choose from, while different algorithms are tailored to different data characteristics. However, a comparative evaluation of these algorithms with different data and hardware characteristics has never been sufficiently conducted in the literature. To close this gap, we conducted an exhaustive experimental survey by evaluating several state-of-the-art lightweight integer compression algorithms as well as cascades of basic techniques. We systematically investigated the influence of data as well as hardware properties on the performance and the compression rates. The evaluated algorithms are based on publicly available implementations as well as our own vectorized reimplementations. We summarize our experimental findings leading to several new insights and to the conclusion that there is no single-best algorithm. Moreover, in this article, we also introduce and evaluate a novel cost model for the selection of a suitable lightweight integer compression algorithm for a given dataset.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"56 1","pages":"1 - 46"},"PeriodicalIF":0.0,"publicationDate":"2019-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74628075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

General Temporally Biased Sampling Schemes for Online Model Management 在线模型管理的一般时间偏差抽样方案

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-06-11 DOI: 10.1145/3360903

Brian Hentschel, P. Haas, Yuanyuan Tian

To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified “decay function.” We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (Targeted-Size Time-Biased Sampling (T-TBS)) that probabilistically maintains a target sample size and a novel reservoir-based scheme (Reservoir-Based Time-Biased Sampling (R-TBS)) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a tradeoff between sample footprint and sample-size stability. R-TBS rests on the notion of a “fractional sample” and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.

为了在不断发展的数据流中保持监督学习模型的准确性，我们提供了时间偏差抽样方案，该方案对最近的数据进行了最重的加权，根据指定的“衰减函数”，给定数据项的包含概率随着时间的推移而衰减。然后我们定期在当前样本上重新训练模型。相对于在所有数据上进行训练，这种方法加快了训练过程。此外，时间偏置让模型适应数据的最新变化，同时——与滑动窗口方法不同——仍然保留一些旧数据，以确保面对数据值的临时波动和周期性时的鲁棒性。此外，基于采样的方法允许将现有的静态数据分析算法基本不加更改地应用于动态流数据。我们提供并分析了一种简单的采样方案(目标大小时间偏差采样(T-TBS))，它在概率上保持目标样本量，以及一种新的基于水库的方案(基于水库的时间偏差采样(R-TBS))，这是第一个提供对衰减率的控制和样本量的保证上限。如果衰减函数是指数型的，那么对衰减率的控制是完全的，R-TBS最大化了期望样本量和样本量稳定性。对于一般衰减函数，实际项目包含概率可以任意接近名义概率，并且我们提供了一个允许在样本占用和样本大小稳定性之间进行权衡的方案。R-TBS基于“分数样本”的概念，允许未知和随时间变化的数据到达率(与T-TBS不同)。R-TBS和T-TBS格式是独立的，扩展了已知的不等概率抽样格式集。我们讨论了分布式实现策略;在Spark上的实验说明了算法的性能和可扩展性，并表明我们的方法可以在面对不断变化的数据时提高机器学习的鲁棒性。

{"title":"General Temporally Biased Sampling Schemes for Online Model Management","authors":"Brian Hentschel, P. Haas, Yuanyuan Tian","doi":"10.1145/3360903","DOIUrl":"https://doi.org/10.1145/3360903","url":null,"abstract":"To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified “decay function.” We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (Targeted-Size Time-Biased Sampling (T-TBS)) that probabilistically maintains a target sample size and a novel reservoir-based scheme (Reservoir-Based Time-Biased Sampling (R-TBS)) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a tradeoff between sample footprint and sample-size stability. R-TBS rests on the notion of a “fractional sample” and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"76 1","pages":"1 - 45"},"PeriodicalIF":0.0,"publicationDate":"2019-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91391270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Interactive Mapping Specification with Exemplar Tuples 具有范例元组的交互式映射规范

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-06-05 DOI: 10.1145/3321485

A. Bonifati, Ugo Comignani, E. Coquery, R. Thion

While schema mapping specification is a cumbersome task for data curation specialists, it becomes unfeasible for non-expert users, who are unacquainted with the semantics and languages of the involved transformations. In this article, we present an interactive framework for schema mapping specification suited for non-expert users. The underlying key intuition is to leverage a few exemplar tuples to infer the underlying mappings and iterate the inference process via simple user interactions under the form of Boolean queries on the validity of the initial exemplar tuples. The approaches available so far are mainly assuming pairs of complete universal data examples, which can be solely provided by data curation experts, or are limited to poorly expressive mappings. We present a quasi-lattice-based exploration of the space of all possible mappings that satisfy arbitrary user exemplar tuples. Along the exploration, we challenge the user to retain the mappings that fit the user’s requirements at best and to dynamically prune the exploration space, thus reducing the number of user interactions. We prove that after the refinement process, the obtained mappings are correct and complete. We present an extensive experimental analysis devoted to measure the feasibility of our interactive mapping strategies and the inherent quality of the obtained mappings.

虽然模式映射规范对于数据管理专家来说是一项繁琐的任务，但对于不熟悉所涉及转换的语义和语言的非专业用户来说，它变得不可行。在本文中，我们提出了一个适合非专业用户的模式映射规范的交互式框架。基础的关键直觉是利用几个示例元组来推断基础映射，并通过对初始示例元组的有效性进行布尔查询的形式，通过简单的用户交互来迭代推理过程。到目前为止，可用的方法主要是假设完整的通用数据示例对，这些示例可以单独由数据管理专家提供，或者仅限于表达能力差的映射。我们提出了一个基于准格的探索空间的所有可能的映射，满足任意用户范例元组。在探索过程中，我们要求用户保留最符合用户需求的映射，并动态地修剪探索空间，从而减少用户交互的次数。我们证明了经过改进后得到的映射是正确完整的。我们提出了一个广泛的实验分析，致力于衡量我们的交互式映射策略的可行性和所获得的映射的内在质量。

{"title":"Interactive Mapping Specification with Exemplar Tuples","authors":"A. Bonifati, Ugo Comignani, E. Coquery, R. Thion","doi":"10.1145/3321485","DOIUrl":"https://doi.org/10.1145/3321485","url":null,"abstract":"While schema mapping specification is a cumbersome task for data curation specialists, it becomes unfeasible for non-expert users, who are unacquainted with the semantics and languages of the involved transformations. In this article, we present an interactive framework for schema mapping specification suited for non-expert users. The underlying key intuition is to leverage a few exemplar tuples to infer the underlying mappings and iterate the inference process via simple user interactions under the form of Boolean queries on the validity of the initial exemplar tuples. The approaches available so far are mainly assuming pairs of complete universal data examples, which can be solely provided by data curation experts, or are limited to poorly expressive mappings. We present a quasi-lattice-based exploration of the space of all possible mappings that satisfy arbitrary user exemplar tuples. Along the exploration, we challenge the user to retain the mappings that fit the user’s requirements at best and to dynamically prune the exploration space, thus reducing the number of user interactions. We prove that after the refinement process, the obtained mappings are correct and complete. We present an extensive experimental analysis devoted to measure the feasibility of our interactive mapping strategies and the inherent quality of the obtained mappings.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"64 1","pages":"1 - 44"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86501076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

A Unified Framework for Frequent Sequence Mining with Subsequence Constraints 基于子序列约束的频繁序列挖掘的统一框架

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-06-05 DOI: 10.1145/3321486

Kaustubh Beedkar, Rainer Gemulla, W. Martens

Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence constraints has been studied in the literature, including length, gap, span, regular-expression, and hierarchy constraints. In this article, we show that many subsequence constraints—including and beyond those considered in the literature—can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. In more detail, we propose a set of simple and intuitive “pattern expressions” to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our algorithms translate pattern expressions to succinct finite-state transducers, which we use as computational model, and simulate these transducers in a way suitable for frequent sequence mining. Our experimental study on real-world datasets indicates that our algorithms—although more general—are efficient and, when used for sequence mining with prior constraints studied in literature, competitive to (and in some cases superior to) state-of-the-art specialized methods.

频繁的序列挖掘方法通常使用约束来控制应该挖掘哪些子序列。各种这样的子序列约束已经在文献中进行了研究，包括长度、间隙、跨度、正则表达式和层次约束。在本文中，我们展示了许多子序列约束(包括文献中考虑的约束和超出这些约束的约束)可以统一到一个框架中。统一处理允许研究人员联合研究多种类型的子序列约束(而不是单独研究每种约束)，有助于提高模式挖掘系统对从业者的可用性。更详细地说，我们提出了一组简单直观的“模式表达式”来描述子序列约束，并探索了在这种一般约束下有效挖掘频繁子序列的算法。我们的算法将模式表达式转换为简洁的有限状态传感器，我们将其用作计算模型，并以适合频繁序列挖掘的方式模拟这些传感器。我们对真实世界数据集的实验研究表明，我们的算法(虽然更通用)是高效的，并且当用于具有文献研究的先验约束的序列挖掘时，与最先进的专业方法相竞争(在某些情况下优于)。

{"title":"A Unified Framework for Frequent Sequence Mining with Subsequence Constraints","authors":"Kaustubh Beedkar, Rainer Gemulla, W. Martens","doi":"10.1145/3321486","DOIUrl":"https://doi.org/10.1145/3321486","url":null,"abstract":"Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence constraints has been studied in the literature, including length, gap, span, regular-expression, and hierarchy constraints. In this article, we show that many subsequence constraints—including and beyond those considered in the literature—can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. In more detail, we propose a set of simple and intuitive “pattern expressions” to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our algorithms translate pattern expressions to succinct finite-state transducers, which we use as computational model, and simulate these transducers in a way suitable for frequent sequence mining. Our experimental study on real-world datasets indicates that our algorithms—although more general—are efficient and, when used for sequence mining with prior constraints studied in literature, competitive to (and in some cases superior to) state-of-the-art specialized methods.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"11 Suppl 3 1","pages":"1 - 42"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89764599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Verification of Hierarchical Artifact Systems 层次工件系统的验证

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-06-05 DOI: 10.1145/3321487

Alin Deutsch, Yuliang Li, V. Vianu

Data-driven workflows, of which IBM’s Business Artifacts are a prime exponent, have been successfully deployed in practice, adopted in industrial standards, and have spawned a rich body of research in academia, focused primarily on static analysis. The present work represents a significant advance on the problem of artifact verification by considering a much richer and more realistic model than in previous work, incorporating core elements of IBM’s successful Guard-Stage-Milestone model. In particular, the model features task hierarchy, concurrency, and richer artifact data. It also allows database key and foreign key dependencies, as well as arithmetic constraints. The results show decidability of verification and establish its complexity, making use of novel techniques including a hierarchy of Vector Addition Systems and a variant of quantifier elimination tailored to our context.

数据驱动的工作流，其中IBM的Business Artifacts是主要的代表，已经成功地在实践中部署，在工业标准中采用，并在学术界产生了丰富的研究主体，主要集中在静态分析上。通过考虑比以前的工作更丰富和更现实的模型，结合IBM成功的Guard-Stage-Milestone模型的核心元素，目前的工作代表了在工件验证问题上的重大进展。特别是，该模型具有任务层次结构、并发性和更丰富的工件数据。它还允许数据库键和外键依赖关系，以及算术约束。结果显示了验证的可判定性，并建立了其复杂性，利用了新的技术，包括向量加法系统的层次结构和量词消除的变体，以适应我们的环境。

引用次数: 8

Output-Optimal Massively Parallel Algorithms for Similarity Joins 相似连接的输出最优大规模并行算法

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-04-08 DOI: 10.1145/3311967

Xiao Hu, K. Yi, Yufei Tao

Parallel join algorithms have received much attention in recent years due to the rapid development of massively parallel systems such as MapReduce and Spark. In the database theory community, most efforts have been focused on studying worst-case optimal algorithms. However, the worst-case optimality of these join algorithms relies on the hard instances having very large output sizes. In the case of a two-relation join, the hard instance is just a Cartesian product, with an output size that is quadratic in the input size. In practice, however, the output size is usually much smaller. One recent parallel join algorithm by Beame et al. has achieved output-optimality (i.e., its cost is optimal in terms of both the input size and the output size), but their algorithm only works for a 2-relation equi-join and has some imperfections. In this article, we first improve their algorithm to true optimality. Then we design output-optimal algorithms for a large class of similarity joins. Finally, we present a lower bound, which essentially eliminates the possibility of having output-optimal algorithms for any join on more than two relations.

近年来，由于MapReduce和Spark等大规模并行系统的迅速发展，并行连接算法受到了广泛的关注。在数据库理论界，大部分的努力都集中在研究最坏情况最优算法上。然而，这些连接算法的最坏情况最优性依赖于具有非常大输出大小的硬实例。在双关系连接的情况下，硬实例只是笛卡尔积，其输出大小是输入大小的二次元。然而，在实践中，输出大小通常要小得多。Beame等人最近的一种并行连接算法已经实现了输出最优性(即，它的成本在输入大小和输出大小方面都是最优的)，但他们的算法只适用于2关系等同连接，并且有一些缺陷。在本文中，我们首先将其算法改进为真正的最优性。然后，我们设计了一种输出最优算法来处理大量的相似连接。最后，我们给出了一个下界，它基本上消除了对两个以上关系的任何连接使用输出最优算法的可能性。

引用次数: 19

Inferring Insertion Times and Optimizing Error Penalties in Time-decaying Bloom Filters 时间衰减布隆滤波器的插入时间推断和误差惩罚优化

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-03-15 DOI: 10.1145/3284552

Jonathan L. Dautrich, C. Ravishankar

Current Bloom Filters tend to ignore Bayesian priors as well as a great deal of useful information they hold, compromising the accuracy of their responses. Incorrect responses cause users to incur penalties that are both application- and item-specific, but current Bloom Filters are typically tuned only for static penalties. Such shortcomings are problematic for all Bloom Filter variants, but especially so for Time-decaying Bloom Filters, in which the memory of older items decays over time, causing both false positives and false negatives. We address these issues by introducing inferential filters, which integrate Bayesian priors and information latent in filters to make penalty-optimal, query-specific decisions. We also show how to properly infer insertion times in such filters. Our methods are general, but here we illustrate their application to inferential time-decaying filters to support novel query types and sliding window queries with dynamic error penalties. We present inferential versions of the Timing Bloom Filter and Generalized Bloom Filter. Our experiments on real and synthetic datasets show that our methods reduce penalties for incorrect responses to sliding-window queries in these filters by up to 70% when penalties are dynamic.

目前的布隆过滤器倾向于忽略贝叶斯先验以及它们所拥有的大量有用信息，从而损害了它们反应的准确性。错误的响应会导致用户招致特定于应用程序和项目的惩罚，但当前的Bloom Filters通常只针对静态惩罚进行了调整。这些缺点对于所有的Bloom Filter变体都是有问题的，但对于时间衰减的Bloom Filters尤其如此，其中旧项目的记忆会随着时间的推移而衰减，导致假阳性和假阴性。我们通过引入推理过滤器来解决这些问题，它集成了贝叶斯先验和过滤器中的潜在信息，以做出惩罚最优的、特定于查询的决策。我们还将展示如何正确地推断此类过滤器中的插入时间。我们的方法是通用的，但这里我们将说明它们在推理时间衰减过滤器中的应用，以支持新颖的查询类型和带有动态错误惩罚的滑动窗口查询。我们提出了时序布隆滤波器和广义布隆滤波器的推理版本。我们在真实和合成数据集上的实验表明，当惩罚是动态的时，我们的方法将对这些过滤器中滑动窗口查询的错误响应的惩罚减少了高达70%。

{"title":"Inferring Insertion Times and Optimizing Error Penalties in Time-decaying Bloom Filters","authors":"Jonathan L. Dautrich, C. Ravishankar","doi":"10.1145/3284552","DOIUrl":"https://doi.org/10.1145/3284552","url":null,"abstract":"Current Bloom Filters tend to ignore Bayesian priors as well as a great deal of useful information they hold, compromising the accuracy of their responses. Incorrect responses cause users to incur penalties that are both application- and item-specific, but current Bloom Filters are typically tuned only for static penalties. Such shortcomings are problematic for all Bloom Filter variants, but especially so for Time-decaying Bloom Filters, in which the memory of older items decays over time, causing both false positives and false negatives. We address these issues by introducing inferential filters, which integrate Bayesian priors and information latent in filters to make penalty-optimal, query-specific decisions. We also show how to properly infer insertion times in such filters. Our methods are general, but here we illustrate their application to inferential time-decaying filters to support novel query types and sliding window queries with dynamic error penalties. We present inferential versions of the Timing Bloom Filter and Generalized Bloom Filter. Our experiments on real and synthetic datasets show that our methods reduce penalties for incorrect responses to sliding-window queries in these filters by up to 70% when penalties are dynamic.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"59 1","pages":"1 - 32"},"PeriodicalIF":0.0,"publicationDate":"2019-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88304329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Survey of Spatial Crowdsourcing 空间众包研究

ACM Transactions on Database Systems (TODS)

Pub Date : 2019-03-15 DOI: 10.1145/3291933

S. Gummidi, Xike Xie, T. Pedersen

Widespread use of advanced mobile devices has led to the emergence of a new class of crowdsourcing called spatial crowdsourcing. Spatial crowdsourcing advances the potential of a crowd to perform tasks related to real-world scenarios involving physical locations, which were not feasible with conventional crowdsourcing methods. The main feature of spatial crowdsourcing is the presence of spatial tasks that require workers to be physically present at a particular location for task fulfillment. Research related to this new paradigm has gained momentum in recent years, necessitating a comprehensive survey to offer a bird’s-eye view of the current state of spatial crowdsourcing literature. In this article, we discuss the spatial crowdsourcing infrastructure and identify the fundamental differences between spatial and conventional crowdsourcing. Furthermore, we provide a comprehensive view of the existing literature by introducing a taxonomy, elucidate the issues/challenges faced by different components of spatial crowdsourcing, and suggest potential research directions for the future.

先进移动设备的广泛使用导致了一种新型众包的出现，即空间众包。空间众包提高了人群执行涉及物理位置的现实场景相关任务的潜力，这是传统众包方法无法实现的。空间众包的主要特点是空间任务的存在，这些任务需要工作人员在特定的地点完成任务。近年来，与这种新范式相关的研究势头强劲，因此有必要对空间众包文献的现状进行全面调查。在本文中，我们讨论了空间众包的基础设施，并确定了空间众包与传统众包的根本区别。在此基础上，对现有文献进行了梳理，提出了空间众包的分类方法，阐述了空间众包的不同组成部分所面临的问题和挑战，并提出了未来的研究方向。

引用次数: 56