首页 > 最新文献

ACM Transactions on Database Systems最新文献

英文 中文
Incremental Graph Computations: Doable and Undoable 增量图计算:可行和不可行的
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2022-05-23 DOI: https://dl.acm.org/doi/full/10.1145/3500930
Wenfei Fan, Chao Tian

The incremental problem for a class ( {mathcal {Q}} ) of graph queries aims to compute, given a query ( Q in {mathcal {Q}} ), graph G, answers Q(G) to Q in G and updates ΔG to G as input, changes ΔO to output Q(G) such that Q(GΔG) = Q(G)⊕ΔO. It is called bounded if its cost can be expressed as a polynomial function in the sizes of Q, ΔG and ΔO, which reduces the computations on possibly big G to small ΔG and ΔO. No matter how desirable, however, our first results are negative: For common graph queries such as traversal, connectivity, keyword search, pattern matching, and maximum cardinality matching, their incremental problems are unbounded.

In light of the negative results, we propose two characterizations for the effectiveness of incremental graph computation: (a) localizable, if its cost is decided by small neighbors of nodes in ΔG instead of the entire G; and (b) bounded relative to a batch graph algorithm ( {mathcal {T}} ), if the cost is determined by the sizes of ΔG and changes to the affected area that is necessarily checked by any algorithms that incrementalize ( {mathcal {T}} ). We show that the incremental computations above are either localizable or relatively bounded by providing corresponding incremental algorithms. That is, we can either reduce the incremental computations on big graphs to small data, or incrementalize existing batch graph algorithms by minimizing unnecessary recomputation. Using real-life and synthetic data, we experimentally verify the effectiveness of our incremental algorithms.

图查询类( {mathcal {Q}} )的增量问题旨在计算,给定查询( Q in {mathcal {Q}} ),图G,将G中的Q(G)回答为Q,并将ΔG更新为G作为输入,将ΔO更改为输出Q(G),使得Q(G⊕ΔG) = Q(G)⊕ΔO。如果它的代价可以表示为大小为Q, ΔG和ΔO的多项式函数,则称为有界,这将可能的大G的计算减少到小ΔG和ΔO。然而,无论多么理想,我们的第一个结果都是否定的:对于常见的图查询,如遍历、连通性、关键字搜索、模式匹配和最大基数匹配,它们的增量问题是无界的。鉴于负面结果,我们提出了增量图计算有效性的两个特征:(a)可本地化,如果其成本由ΔG中节点的小邻居决定,而不是整个G;(b)相对于批处理图算法( {mathcal {T}} )有界,如果成本由ΔG的大小和受影响区域的变化决定,则必须由任何对( {mathcal {T}} )进行增量化的算法检查。通过提供相应的增量算法,我们证明了上述增量计算要么是可本地化的,要么是相对有界的。也就是说,我们可以将大图上的增量计算减少到小数据上,或者通过最小化不必要的重新计算来增量化现有的批处理图算法。使用真实数据和合成数据,我们通过实验验证了增量算法的有效性。
{"title":"Incremental Graph Computations: Doable and Undoable","authors":"Wenfei Fan, Chao Tian","doi":"https://dl.acm.org/doi/full/10.1145/3500930","DOIUrl":"https://doi.org/https://dl.acm.org/doi/full/10.1145/3500930","url":null,"abstract":"<p>The incremental problem for a class ( {mathcal {Q}} ) of graph queries aims to compute, given a query ( Q in {mathcal {Q}} ), graph <i>G</i>, answers <i>Q</i>(<i>G</i>) to <i>Q</i> in <i>G</i> and updates <i>ΔG</i> to <i>G</i> as input, changes <i>ΔO</i> to output <i>Q</i>(<i>G</i>) such that <i>Q</i>(<i>G</i>⊕<i>ΔG</i>) = <i>Q</i>(<i>G</i>)⊕<i>ΔO</i>. It is called <i>bounded</i> if its cost can be expressed as a polynomial function in the sizes of <i>Q</i>, <i>ΔG</i> and <i>ΔO</i>, which reduces the computations on possibly big <i>G</i> to small <i>ΔG</i> and <i>ΔO</i>. No matter how desirable, however, our first results are negative: For common graph queries such as traversal, connectivity, keyword search, pattern matching, and maximum cardinality matching, their incremental problems are unbounded. </p><p>In light of the negative results, we propose two characterizations for the effectiveness of incremental graph computation: (a) <i>localizable</i>, if its cost is decided by small neighbors of nodes in <i>ΔG</i> instead of the entire <i>G</i>; and (b) <i>bounded relative to</i> a batch graph algorithm ( {mathcal {T}} ), if the cost is determined by the sizes of <i>ΔG</i> and changes to the affected area that is necessarily checked by any algorithms that incrementalize ( {mathcal {T}} ). We show that the incremental computations above are either localizable or relatively bounded by providing corresponding incremental algorithms. That is, we can either reduce the incremental computations on big graphs to small data, or incrementalize existing batch graph algorithms by minimizing unnecessary recomputation. Using real-life and synthetic data, we experimentally verify the effectiveness of our incremental algorithms.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"7 2","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Embedded Functional Dependencies and Data-completeness Tailored Database Design 嵌入式功能依赖和数据完整性定制数据库设计
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2021-05-30 DOI: 10.1145/3450518
Ziheng Wei, Sebastian Link
We establish a principled schema design framework for data with missing values. The framework is based on the new notion of an embedded functional dependency, which is independent of the interpretation of missing values, able to express completeness and integrity requirements on application data, and capable of capturing redundant data value occurrences that may cause problems with processing data that meets the requirements. We establish axiomatic, algorithmic, and logical foundations for reasoning about embedded functional dependencies. These foundations enable us to introduce generalizations of Boyce-Codd and Third normal forms that avoid processing difficulties of any application data, or minimize these difficulties across dependency-preserving decompositions, respectively. We show how to transform any given schema into application schemata that meet given completeness and integrity requirements, and the conditions of the generalized normal forms. Data over those application schemata are therefore fit for purpose by design. Extensive experiments with benchmark schemata and data illustrate the effectiveness of our framework for the acquisition of the constraints, the schema design process, and the performance of the schema designs in terms of updates and join queries.
我们为缺失值的数据建立了一个有原则的模式设计框架。该框架基于嵌入式功能依赖的新概念,它独立于对缺失值的解释,能够表达应用程序数据的完整性和完整性需求,并能够捕获可能导致处理满足需求的数据时出现问题的冗余数据值。我们为嵌入式功能依赖关系的推理建立了公理、算法和逻辑基础。这些基础使我们能够引入Boyce-Codd和Third范式的一般化,它们分别避免了任何应用程序数据的处理困难,或者最小化了跨依赖保持分解的这些困难。我们将展示如何将任何给定的模式转换为满足给定的完整性和完整性要求以及广义范式的条件的应用程序模式。因此,这些应用程序模式上的数据符合设计目的。对基准模式和数据的大量实验说明了我们的框架在获取约束、模式设计过程以及模式设计在更新和连接查询方面的性能方面的有效性。
{"title":"Embedded Functional Dependencies and Data-completeness Tailored Database Design","authors":"Ziheng Wei, Sebastian Link","doi":"10.1145/3450518","DOIUrl":"https://doi.org/10.1145/3450518","url":null,"abstract":"We establish a principled schema design framework for data with missing values. The framework is based on the new notion of an embedded functional dependency, which is independent of the interpretation of missing values, able to express completeness and integrity requirements on application data, and capable of capturing redundant data value occurrences that may cause problems with processing data that meets the requirements. We establish axiomatic, algorithmic, and logical foundations for reasoning about embedded functional dependencies. These foundations enable us to introduce generalizations of Boyce-Codd and Third normal forms that avoid processing difficulties of any application data, or minimize these difficulties across dependency-preserving decompositions, respectively. We show how to transform any given schema into application schemata that meet given completeness and integrity requirements, and the conditions of the generalized normal forms. Data over those application schemata are therefore fit for purpose by design. Extensive experiments with benchmark schemata and data illustrate the effectiveness of our framework for the acquisition of the constraints, the schema design process, and the performance of the schema designs in terms of updates and join queries.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"32 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2021-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Constant-Delay Enumeration for Nondeterministic Document Spanners 不确定文档生成器的恒定延迟枚举
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2021-04-14 DOI: 10.1145/3436487
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth
We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS’18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, particularly for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.
我们考虑了被称为文档生成器的信息提取框架,并研究了从输入文档中高效计算提取结果的问题,其中提取任务被描述为顺序变量集自动机(VA)。我们在枚举算法的设置中提出了这个问题,在枚举算法中,我们可以首先运行预处理阶段,然后必须在任意两个连续结果之间以很小的延迟产生结果。我们的目标是拥有一种算法,它可以处理组合复杂性,即输入文档和VA的大小,同时确保输入文档大小的最佳数据复杂性界限,即文档大小的恒定延迟。最近在PODS’18上的几项工作提出了这样的算法,但在文档大小上存在线性延迟,或者在(通常不确定的)输入VA的大小上存在指数依赖关系。特别是,Florenzano等人认为,我们期望的运行时间保证不能满足一般顺序VA。我们反驳了这一点,并证明,给定一个不确定的顺序VA和一个输入文档,我们可以枚举VA在文档上的映射,其边界如下:预处理在文档大小和VA大小上是线性的,并且延迟与文档和VA大小的多项式无关。因此,所得算法在组合复杂度和最佳数据复杂度边界上实现了可跟踪性。此外,它很容易描述,特别是对于所谓的扩展VAs的限制情况。最后,我们使用原型实现对我们的算法进行了经验评估。
{"title":"Constant-Delay Enumeration for Nondeterministic Document Spanners","authors":"Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth","doi":"10.1145/3436487","DOIUrl":"https://doi.org/10.1145/3436487","url":null,"abstract":"We consider the information extraction framework known as <jats:italic>document spanners</jats:italic> and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential <jats:italic>variable-set automaton</jats:italic> (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS’18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, particularly for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"22 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2021-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Functional Aggregate Queries with Additive Inequalities 具有可加不等式的函数聚合查询
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2020-12-06 DOI: 10.1145/3426865
KhamisMahmoud Abo, R. CurtinRyan, MoseleyBenjamin, Q. NgoHung, NguyenXuanlong, OlteanuDan, SchleichMaximilian
Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering functional aggregate queries (FAQ) in which some of the input fac...
在数据库和关系机器学习的基础应用的激励下,我们制定并研究了回答功能聚合查询(FAQ)的问题,其中一些输入面对…
{"title":"Functional Aggregate Queries with Additive Inequalities","authors":"KhamisMahmoud Abo, R. CurtinRyan, MoseleyBenjamin, Q. NgoHung, NguyenXuanlong, OlteanuDan, SchleichMaximilian","doi":"10.1145/3426865","DOIUrl":"https://doi.org/10.1145/3426865","url":null,"abstract":"Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering functional aggregate queries (FAQ) in which some of the input fac...","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"62 1","pages":"1-41"},"PeriodicalIF":1.8,"publicationDate":"2020-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88257893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Efficient Sorting, Duplicate Removal, Grouping, and Aggregation 高效排序、重复删除、分组和聚合
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2020-10-01 DOI: 10.1145/3568027
Thanh Do, G. Graefe, J. Naughton
Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors, including the sort order of the input, input and output sizes, and the need for sorted output. For example, hash-based aggregation is ideal for output smaller than the available memory (e.g., Query 1 of TPC-H), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join. Unfortunately, the size information required for a sound choice is often inaccurate or unavailable during query optimization, leading to sub-optimal algorithm choices. In response, this article introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation. The new algorithm always performs at least as well as both traditional hash-based and traditional sort-based algorithms. It can serve as a system’s only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices. Furthermore, the new algorithm produces sorted output that can speed up subsequent operations. Google’s F1 Query uses the new algorithm in production workloads that aggregate petabytes of data every day.
数据库查询处理需要重复删除、分组和聚合算法。目前存在三种算法:流内聚合是目前最有效的,但需要排序输入;基于排序的聚合依赖于外部归并排序;哈希聚合依赖于内存中的哈希表以及对临时存储的哈希分区。基于成本的查询优化根据几个因素选择使用哪种算法,包括输入的排序顺序、输入和输出的大小,以及对排序输出的需求。例如,对于小于可用内存的输出(例如,TPC-H的查询1),基于散列的聚合是理想的,而当聚合输入和输出都很大并且需要为后续操作(如合并连接)对输出进行排序时,对整个输入进行排序并在排序后进行聚合是可取的。不幸的是,在查询优化期间,合理选择所需的大小信息通常不准确或不可用,从而导致次优算法选择。作为回应,本文介绍了一种新的基于排序的重复删除、分组和聚合算法。新算法的性能至少与传统的基于哈希和基于排序的算法一样好。它可以作为系统对未排序输入的唯一聚合算法,从而防止错误的算法选择。此外,新算法产生排序输出,可以加快后续操作。谷歌的F1 Query在每天聚合数pb数据的生产工作负载中使用新算法。
{"title":"Efficient Sorting, Duplicate Removal, Grouping, and Aggregation","authors":"Thanh Do, G. Graefe, J. Naughton","doi":"10.1145/3568027","DOIUrl":"https://doi.org/10.1145/3568027","url":null,"abstract":"Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm to use based on several factors, including the sort order of the input, input and output sizes, and the need for sorted output. For example, hash-based aggregation is ideal for output smaller than the available memory (e.g., Query 1 of TPC-H), whereas sorting the entire input and aggregating after sorting are preferable when both aggregation input and output are large and the output needs to be sorted for a subsequent operation such as a merge join. Unfortunately, the size information required for a sound choice is often inaccurate or unavailable during query optimization, leading to sub-optimal algorithm choices. In response, this article introduces a new algorithm for sort-based duplicate removal, grouping, and aggregation. The new algorithm always performs at least as well as both traditional hash-based and traditional sort-based algorithms. It can serve as a system’s only aggregation algorithm for unsorted inputs, thus preventing erroneous algorithm choices. Furthermore, the new algorithm produces sorted output that can speed up subsequent operations. Google’s F1 Query uses the new algorithm in production workloads that aggregate petabytes of data every day.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"47 1","pages":"1 - 35"},"PeriodicalIF":1.8,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43440262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Conjunctive Queries: Unique Characterizations and Exact Learnability 连接查询:独特的表征和精确的可学习性
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2020-08-16 DOI: 10.1145/3559756
B. T. Cate, V. Dalmau
We answer the question of which conjunctive queries are uniquely characterized by polynomially many positive and negative examples and how to construct such examples efficiently. As a consequence, we obtain a new efficient exact learning algorithm for a class of conjunctive queries. At the core of our contributions lie two new polynomial-time algorithms for constructing frontiers in the homomorphism lattice of finite structures. We also discuss implications for the unique characterizability and learnability of schema mappings and of description logic concepts.
我们回答了哪些连接查询具有多项式多个正例和负例的唯一特征以及如何有效地构造这样的例子的问题。因此,我们得到了一种新的高效的精确学习算法。我们贡献的核心是两个新的多项式时间算法,用于在有限结构的同态格中构造边界。我们还讨论了模式映射和描述逻辑概念的独特特征和可学习性的含义。
{"title":"Conjunctive Queries: Unique Characterizations and Exact Learnability","authors":"B. T. Cate, V. Dalmau","doi":"10.1145/3559756","DOIUrl":"https://doi.org/10.1145/3559756","url":null,"abstract":"We answer the question of which conjunctive queries are uniquely characterized by polynomially many positive and negative examples and how to construct such examples efficiently. As a consequence, we obtain a new efficient exact learning algorithm for a class of conjunctive queries. At the core of our contributions lie two new polynomial-time algorithms for constructing frontiers in the homomorphism lattice of finite structures. We also discuss implications for the unique characterizability and learnability of schema mappings and of description logic concepts.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"47 1","pages":"1 - 41"},"PeriodicalIF":1.8,"publicationDate":"2020-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64060120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Efficient Enumeration Algorithms for Regular Document Spanners 常规文档生成器的高效枚举算法
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2020-02-08 DOI: 10.1145/3351451
FlorenzanoFernando, RiverosCristian, UgarteMartín, VansummerenStijn, VrgočDomagoj
Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages to...
正则表达式和带有捕获变量的自动机模型是基于规则的信息抽取的核心工具。这些形式化,也称为常规文档生成器,使用常规语言来…
{"title":"Efficient Enumeration Algorithms for Regular Document Spanners","authors":"FlorenzanoFernando, RiverosCristian, UgarteMartín, VansummerenStijn, VrgočDomagoj","doi":"10.1145/3351451","DOIUrl":"https://doi.org/10.1145/3351451","url":null,"abstract":"Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages to...","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"45 1","pages":"1-42"},"PeriodicalIF":1.8,"publicationDate":"2020-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3351451","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64021639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Distributed Joins and Data Placement for Minimal Network Traffic 最小网络流量的分布式连接和数据放置
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2018-11-26 DOI: 10.1145/3241039
Orestis Polychroniou, Wangda Zhang, K. A. Ross
Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.
在为大规模分析而部署的分布式并行数据库中,网络通信是许多操作器中最慢的组件。虽然在现代硬件上的大量工作集中在加快数据库的速度,但减少通信受到的关注较少。现有的并行dbms依赖于为磁盘设计的算法,并对网络进行了少量修改。更复杂的算法可能会增加cpu的负担,但可以避免元组在网络上的冗余传输。我们介绍了跟踪连接,这是一种新的分布式连接算法,通过为每个不同的连接键生成最优传输调度来最小化网络流量。跟踪连接扩展了CPU和网络之间的权衡选项。跟踪连接显式地检测和利用局部性,还允许在单个属性上进行散列分区之外对元组进行高级放置。我们提出了一种新的基于轨道连接的数据放置算法,该算法可以最大限度地减少分析工作负载中跨不同维度的多个连接的总网络成本。我们的评估表明,在网络流量和执行时间方面,跟踪连接在实际工作负载中最昂贵的查询上优于散列连接。最后,我们展示了我们的数据放置优化方法在最小化分析工作负载中连接的总网络成本方面既健壮又有效。
{"title":"Distributed Joins and Data Placement for Minimal Network Traffic","authors":"Orestis Polychroniou, Wangda Zhang, K. A. Ross","doi":"10.1145/3241039","DOIUrl":"https://doi.org/10.1145/3241039","url":null,"abstract":"Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 1","pages":"14:1-14:45"},"PeriodicalIF":1.8,"publicationDate":"2018-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82766196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A Relational Framework for Classifier Engineering 分类器工程的关系框架
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2018-11-26 DOI: 10.1145/3268931
B. Kimelfeld, C. Ré
In the design of analytical procedures and machine learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this article, we embark on the establishment of database foundations for feature engineering. We propose a formal framework for classification in the context of a relational database. The goal of this framework is to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database’s modeling and understanding of data and queries and by deploying the well-studied principles of database management. As a first step, we demonstrate the usefulness of this framework by formally defining three key algorithmic challenges. The first challenge is that of separability, which is the problem of determining the existence of feature queries that agree with the training examples. The second is that of evaluating the VC dimension of the model class with respect to a given sequence of feature queries. The third challenge is identifiability, which is the task of testing for a property of independence among features that are represented as database queries. We give preliminary results on these challenges for the case where features are defined by means of conjunctive queries, and, in particular, we study the implication of various traditional syntactic restrictions on the inherent computational complexity.
在分析过程和机器学习解决方案的设计中,特征工程是一项关键且耗时的任务,为此已经开发了各种配方和工具方法。在本文中,我们着手建立特征工程的数据库基础。我们提出了一个关系型数据库背景下的正式分类框架。该框架的目标是通过利用数据库的建模和对数据和查询的理解,以及部署经过充分研究的数据库管理原则,为帮助开发人员完成特征工程任务的研究和技术开辟道路。作为第一步,我们通过正式定义三个关键算法挑战来证明该框架的有用性。第一个挑战是可分离性,即确定与训练样例一致的特征查询是否存在的问题。第二种方法是根据给定的特征查询序列评估模型类的VC维。第三个挑战是可识别性,这是测试表示为数据库查询的特性之间的独立性的任务。对于通过连接查询定义特征的情况,我们给出了这些挑战的初步结果,特别是,我们研究了各种传统语法限制对固有计算复杂性的影响。
{"title":"A Relational Framework for Classifier Engineering","authors":"B. Kimelfeld, C. Ré","doi":"10.1145/3268931","DOIUrl":"https://doi.org/10.1145/3268931","url":null,"abstract":"In the design of analytical procedures and machine learning solutions, a critical and time-consuming task is that of feature engineering, for which various recipes and tooling approaches have been developed. In this article, we embark on the establishment of database foundations for feature engineering. We propose a formal framework for classification in the context of a relational database. The goal of this framework is to open the way to research and techniques to assist developers with the task of feature engineering by utilizing the database’s modeling and understanding of data and queries and by deploying the well-studied principles of database management. As a first step, we demonstrate the usefulness of this framework by formally defining three key algorithmic challenges. The first challenge is that of separability, which is the problem of determining the existence of feature queries that agree with the training examples. The second is that of evaluating the VC dimension of the model class with respect to a given sequence of feature queries. The third challenge is identifiability, which is the task of testing for a property of independence among features that are represented as database queries. We give preliminary results on these challenges for the case where features are defined by means of conjunctive queries, and, in particular, we study the implication of various traditional syntactic restrictions on the inherent computational complexity.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"11 1","pages":"11:1-11:36"},"PeriodicalIF":1.8,"publicationDate":"2018-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89769531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The five color concurrency control protocol: non-two-phase locking in general databases 五色并发控制协议:一般数据库中的非两阶段锁定
IF 1.8 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2018-03-02 DOI: 10.1145/78922.78927
P. Dasgupta, Z. Kedem
Concurrency control protocols based on two-phase locking are a popular family of locking protocols that preserve serializability in general (unstructured) database systems. A concurrency control algorithm (for databases with no inherent structure) is presented that is practical, non two-phase, and allows varieties of serializable logs not possible with any commonly known locking schemes. All transactions are required to predeclare the data they intend to read or write. Using this information, the protocol anticipates the existence (or absence) of possible conflicts and hence can allow non-two-phase locking.It is well known that serializability is characterized by acyclicity of the conflict graph representation of interleaved executions. The two-phase locking protocols allow only forward growth of the paths in the graph. The Five Color protocol allows the conflict graph to grow in any direction (avoiding two-phase constraints) and prevents cycles in the graph by maintaining transaction access information in the form of data-item markers. The read and write set information can also be used to provide relative immunity from deadlocks.
基于两阶段锁定的并发控制协议是一类流行的锁定协议,它们在一般(非结构化)数据库系统中保持可序列化性。提出了一种实用的并发控制算法(适用于没有固有结构的数据库),它是非两阶段的,并允许使用任何常用锁定方案都无法实现的各种序列化日志。所有事务都需要预先声明它们打算读取或写入的数据。使用这些信息,协议可以预测可能的冲突的存在(或不存在),因此可以允许非两阶段锁定。众所周知,串行性的特点是交错执行的冲突图表示的不周期性。两阶段锁定协议只允许图中的路径向前增长。Five Color协议允许冲突图向任何方向发展(避免两阶段约束),并通过以数据项标记的形式维护事务访问信息来防止图中的循环。读写集信息还可用于提供相对免于死锁的免疫力。
{"title":"The five color concurrency control protocol: non-two-phase locking in general databases","authors":"P. Dasgupta, Z. Kedem","doi":"10.1145/78922.78927","DOIUrl":"https://doi.org/10.1145/78922.78927","url":null,"abstract":"Concurrency control protocols based on two-phase locking are a popular family of locking protocols that preserve serializability in general (unstructured) database systems. A concurrency control algorithm (for databases with no inherent structure) is presented that is practical, non two-phase, and allows varieties of serializable logs not possible with any commonly known locking schemes. All transactions are required to predeclare the data they intend to read or write. Using this information, the protocol anticipates the existence (or absence) of possible conflicts and hence can allow non-two-phase locking.\u0000It is well known that serializability is characterized by acyclicity of the conflict graph representation of interleaved executions. The two-phase locking protocols allow only forward growth of the paths in the graph. The Five Color protocol allows the conflict graph to grow in any direction (avoiding two-phase constraints) and prevents cycles in the graph by maintaining transaction access information in the form of data-item markers. The read and write set information can also be used to provide relative immunity from deadlocks.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"29 1","pages":"281-307"},"PeriodicalIF":1.8,"publicationDate":"2018-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76949220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
期刊
ACM Transactions on Database Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1