ACM Transactions on Database Systems最新文献_第4页

Efficiently Cleaning Structured Event Logs: A Graph Repair Approach 有效地清理结构化事件日志:一种图修复方法

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-11-17 DOI: 10.1145/3571281

Ruihong Huang, Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, Jian Pei

Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural information does exist among events, such as the task passing relationships between staffs in workflow or the invocation relationships among different micro-services in monitoring application performance. We argue that such structural information enhances not only the accuracy of repairing inconsistent events but also the computation efficiency. It is notable that both the structure and the names (labeling) of events could be inconsistent. In real applications, while an unsound structure is not repaired automatically (which requires manual effort from business actors to handle the structure error), it is highly desirable to repair the inconsistent event names introduced by recording mistakes. In this article, we first prove that the inconsistent label repairing problem is NP-complete. Then, we propose a graph repair approach for (1) detecting unsound structures, and (2) repairing inconsistent event names. Efficient pruning techniques together with two heuristic solutions are also presented. Extensive experiments over real and synthetic datasets demonstrate both the effectiveness and efficiency of our proposal.

由于各种记录约定或简单的系统错误，事件数据通常是脏的。这些错误可能会对实际应用程序造成严重损害，例如不准确的来源答案、糟糕的分析结果，或者从事件数据中隐藏有趣的模式。强烈要求清理脏事件数据。虽然现有的事件数据清理技术将事件日志视为序列，但事件之间确实存在结构化信息，例如工作流中人员之间的任务传递关系，或者监视应用程序性能时不同微服务之间的调用关系。我们认为这种结构信息不仅提高了修复不一致事件的准确性，而且提高了计算效率。值得注意的是，事件的结构和名称(标记)可能不一致。在实际的应用程序中，虽然不会自动修复不健全的结构(这需要业务参与者手动处理结构错误)，但非常希望修复由记录错误引入的不一致的事件名称。在本文中，我们首先证明了不一致标签修复问题是np完全的。然后，我们提出了一种图修复方法，用于(1)检测不健全的结构，(2)修复不一致的事件名称。提出了有效的剪枝技术和两种启发式解决方案。在真实和合成数据集上进行的大量实验证明了我们的建议的有效性和效率。

{"title":"Efficiently Cleaning Structured Event Logs: A Graph Repair Approach","authors":"Ruihong Huang, Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, Jian Pei","doi":"10.1145/3571281","DOIUrl":"https://doi.org/10.1145/3571281","url":null,"abstract":"Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural information does exist among events, such as the task passing relationships between staffs in workflow or the invocation relationships among different micro-services in monitoring application performance. We argue that such structural information enhances not only the accuracy of repairing inconsistent events but also the computation efficiency. It is notable that both the structure and the names (labeling) of events could be inconsistent. In real applications, while an unsound structure is not repaired automatically (which requires manual effort from business actors to handle the structure error), it is highly desirable to repair the inconsistent event names introduced by recording mistakes. In this article, we first prove that the inconsistent label repairing problem is NP-complete. Then, we propose a graph repair approach for (1) detecting unsound structures, and (2) repairing inconsistent event names. Efficient pruning techniques together with two heuristic solutions are also presented. Extensive experiments over real and synthetic datasets demonstrate both the effectiveness and efficiency of our proposal.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"31 1","pages":"1 - 44"},"PeriodicalIF":1.8,"publicationDate":"2022-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90407660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Conjunctive Queries: Unique Characterizations and Exact Learnability 连接查询:独特的表征和精确的可学习性

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-11-06 DOI: https://dl.acm.org/doi/10.1145/3559756

Balder Ten Cate, Victor Dalmau

We answer the question of which conjunctive queries are uniquely characterized by polynomially many positive and negative examples and how to construct such examples efficiently. As a consequence, we obtain a new efficient exact learning algorithm for a class of conjunctive queries. At the core of our contributions lie two new polynomial-time algorithms for constructing frontiers in the homomorphism lattice of finite structures. We also discuss implications for the unique characterizability and learnability of schema mappings and of description logic concepts.

我们回答了哪些连接查询具有多项式多个正例和负例的唯一特征以及如何有效地构造这样的例子的问题。因此，我们得到了一种新的高效的精确学习算法。我们贡献的核心是两个新的多项式时间算法，用于在有限结构的同态格中构造边界。我们还讨论了模式映射和描述逻辑概念的独特特征和可学习性的含义。

引用次数: 0

Deciding Robustness for Lower SQL Isolation Levels 决定较低SQL隔离级别的健壮性

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-11-06 DOI: https://dl.acm.org/doi/10.1145/3561049

Bas Ketsman, Christoph Koch, Frank Neven, Brecht Vandevoort

While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this article, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED, which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this article is that we characterize robustness against both isolation levels in terms of the absence of counter-example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete’s work, is that the properties of cycles obtained in this article have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this article is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness.

虽然可序列化性总是保证应用程序的正确性，但可以选择较低的隔离级别来提高事务吞吐量，但要冒引入某些异常的风险。如果在指定隔离级别下的事务的每个可能的交错都是可序列化的，那么一组事务对于给定的隔离级别是健壮的。因此，鲁棒性总是保证应用程序的正确性，并具有较低隔离级别的性能优势。虽然鲁棒性问题在文献中得到了相当大的关注，但只有充分的条件才得到。最值得注意的例外是Fekete的开创性工作，他在那里获得了决定对SNAPSHOT隔离的鲁棒性的表征。在本文中，我们将解决较低SQL隔离级别READ UNCOMMITTED和READ COMMITTED的健壮性问题，这两个级别是根据禁止的脏写和脏读模式定义的。本文的第一个主要贡献是，我们根据不存在特定形式的反例调度(分裂和多分裂调度)和不存在满足各种性质的干涉图中的循环来描述对两种隔离级别的鲁棒性。与Fekete的工作的一个关键区别是，本文中获得的循环属性必须考虑事务中操作的相对顺序，因为READ UNCOMMITTED和READ COMMITTED不能满足原子可见性要求。一个特别的结果是后者呈现了针对READ COMMITTED cop -complete的鲁棒性问题。本文的第二个主要贡献是conp硬度证明。对于READ UNCOMMITTED，我们获得了日志空间完整性。

{"title":"Deciding Robustness for Lower SQL Isolation Levels","authors":"Bas Ketsman, Christoph Koch, Frank Neven, Brecht Vandevoort","doi":"https://dl.acm.org/doi/10.1145/3561049","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3561049","url":null,"abstract":"While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this article, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED, which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this article is that we characterize robustness against both isolation levels in terms of the absence of counter-example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete’s work, is that the properties of cycles obtained in this article have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this article is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"20 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust and Efficient Sorting with Offset-value Coding 基于偏移值编码的鲁棒高效排序

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-09-17 DOI: 10.1145/3570956

Thanh Do, G. Graefe

Sorting and searching are large parts of database query processing, e.g., in the forms of index creation, index maintenance, and index lookup, and comparing pairs of keys is a substantial part of the effort in sorting and searching. We have worked on simple, efficient implementations of decades-old, neglected, effective techniques for fast comparisons and fast sorting, in particular offset-value coding. In the process, we happened upon its mutually beneficial relationship with prefix truncation in run files as well as the duality of compression techniques in row- and column-format storage structures, namely prefix truncation and run-length encoding of leading key columns. We also found a beneficial relationship with consumers of sorted streams, e.g., merging parallel streams, in-stream aggregation, and merge join. We report on our implementation in the context of Google’s Napa and F1 Query systems as well as an experimental evaluation of performance and scalability.

排序和搜索是数据库查询处理的重要组成部分，例如，以索引创建、索引维护和索引查找的形式，而比较键对是排序和搜索工作的重要部分。我们研究了几十年前被忽视的快速比较和快速排序的有效技术的简单高效实现，特别是偏移值编码。在此过程中，我们发现了它与运行文件中的前缀截断以及行和列格式存储结构中压缩技术的双重性，即前缀截断和前导键列的游程编码之间的互利关系。我们还发现了与排序流的消费者之间的有益关系，例如，合并并行流、流内聚合和合并联接。我们报告了我们在谷歌Napa和F1查询系统中的实现，以及对性能和可扩展性的实验评估。

引用次数: 1

Proximity Queries on Terrain Surface 地形表面邻近查询

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-09-16 DOI: 10.1145/3563773

Victor Junqiu Wei, R. C. Wong, Cheng Long, D. Mount, H. Samet

Due to the advance of the geo-spatial positioning and the computer graphics technology, digital terrain data has become increasingly popular nowadays. Query processing on terrain data has attracted considerable attention from both the academic and the industry communities. Proximity queries such as the shortest path/distance query, k nearest/farthest neighbor query, and top-k closest/farthest pairs query are fundamental and important queries in the context of the terrain surfaces, and they have a lot of applications in Geographical Information System, 3D object feature vector construction, and 3D object data mining. In this article, we first study the most fundamental type of query, namely, shortest distance and path query, which is to find the shortest distance and path between two points of interest on the surface of the terrain. As observed by existing studies, computing the exact shortest distance/path is very expensive. Some existing studies proposed ϵ-approximate distance and path oracles, where ϵ is a non-negative real-valued error parameter. However, the best-known algorithm has a large oracle construction time, a large oracle size, and a large query time. Motivated by this, we propose a novel ϵ-approximate distance and path oracle called the Space Efficient distance and path oracle (SE), which has a small oracle construction time, a small oracle size, and a small distance and path query time, thanks to its compactness of storing concise information about pairwise distances between any two points-of-interest. Then, we propose several algorithms for the k nearest/farthest neighbor and top-k closest/farthest pairs queries with the assistance of our distance and path oracle SE. Our experimental results show that the oracle construction time, the oracle size, and the distance and path query time of SE are up to two, three, and five orders of magnitude faster than the best-known algorithm, respectively. Besides, our algorithms for other proximity queries including k nearest/farthest neighbor queries and top-k closest/farthest pairs queries significantly outperform the state-of-the-art algorithms by up to two orders of magnitude.

由于地理空间定位和计算机图形学技术的进步，数字地形数据在当今日益流行。地形数据的查询处理已经引起了学术界和行业界的极大关注。最短路径/距离查询、k最近/最远邻居查询和top-k最近/最远距离对查询等邻近查询是地表环境中的基础查询和重要查询，在地理信息系统、三维对象特征向量构建和三维对象数据挖掘中有着广泛的应用。在本文中，我们首先研究了最基本的查询类型，即最短距离和路径查询，即找到地形表面上两个兴趣点之间的最短距离。正如现有研究所观察到的，计算精确的最短距离/路径是非常昂贵的。现有的一些研究提出了距离和路径的近似预言，其中，距离和路径是一个非负的实值误差参数。然而，最著名的算法具有较大的oracle构建时间、较大的oracle大小和较大的查询时间。受此启发，我们提出了一种新的距离和路径近似预言器，称为空间有效距离和路径预言器（SE），由于其存储任何两个兴趣点之间的成对距离的简明信息的紧凑性，它具有小的预言器构建时间、小的预言机大小以及小的距离和通路查询时间。然后，在距离和路径oracleSE的帮助下，我们提出了几种针对k个最近/最远邻居和top-k个最近-最远对查询的算法。我们的实验结果表明，SE的oracle构建时间、oracle大小以及距离和路径查询时间分别比最著名的算法快两个、三个和五个数量级。此外，我们针对其他邻近查询的算法，包括k个最近/最远邻居查询和前k个最接近/最远对查询，显著优于最先进的算法两个数量级。

{"title":"Proximity Queries on Terrain Surface","authors":"Victor Junqiu Wei, R. C. Wong, Cheng Long, D. Mount, H. Samet","doi":"10.1145/3563773","DOIUrl":"https://doi.org/10.1145/3563773","url":null,"abstract":"Due to the advance of the geo-spatial positioning and the computer graphics technology, digital terrain data has become increasingly popular nowadays. Query processing on terrain data has attracted considerable attention from both the academic and the industry communities. Proximity queries such as the shortest path/distance query, k nearest/farthest neighbor query, and top-k closest/farthest pairs query are fundamental and important queries in the context of the terrain surfaces, and they have a lot of applications in Geographical Information System, 3D object feature vector construction, and 3D object data mining. In this article, we first study the most fundamental type of query, namely, shortest distance and path query, which is to find the shortest distance and path between two points of interest on the surface of the terrain. As observed by existing studies, computing the exact shortest distance/path is very expensive. Some existing studies proposed ϵ-approximate distance and path oracles, where ϵ is a non-negative real-valued error parameter. However, the best-known algorithm has a large oracle construction time, a large oracle size, and a large query time. Motivated by this, we propose a novel ϵ-approximate distance and path oracle called the Space Efficient distance and path oracle (SE), which has a small oracle construction time, a small oracle size, and a small distance and path query time, thanks to its compactness of storing concise information about pairwise distances between any two points-of-interest. Then, we propose several algorithms for the k nearest/farthest neighbor and top-k closest/farthest pairs queries with the assistance of our distance and path oracle SE. Our experimental results show that the oracle construction time, the oracle size, and the distance and path query time of SE are up to two, three, and five orders of magnitude faster than the best-known algorithm, respectively. Besides, our algorithms for other proximity queries including k nearest/farthest neighbor queries and top-k closest/farthest pairs queries significantly outperform the state-of-the-art algorithms by up to two orders of magnitude.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"47 1","pages":"1 - 59"},"PeriodicalIF":1.8,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41730175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Deciding Robustness for Lower SQL Isolation Levels 决定较低SQL隔离级别的健壮性

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-09-03 DOI: 10.1145/3561049

Bas Ketsman, Christoph E. Koch, F. Neven, Brecht Vandevoort

While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this article, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED, which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this article is that we characterize robustness against both isolation levels in terms of the absence of counter-example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete’s work, is that the properties of cycles obtained in this article have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this article is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness.

虽然可序列化性总是保证应用程序的正确性，但可以选择较低的隔离级别来提高事务吞吐量，但要冒引入某些异常的风险。如果在指定隔离级别下的事务的每个可能的交错都是可序列化的，那么一组事务对于给定的隔离级别是健壮的。因此，鲁棒性总是保证应用程序的正确性，并具有较低隔离级别的性能优势。虽然鲁棒性问题在文献中得到了相当大的关注，但只有充分的条件才得到。最值得注意的例外是Fekete的开创性工作，他在那里获得了决定对SNAPSHOT隔离的鲁棒性的表征。在本文中，我们将解决较低SQL隔离级别READ UNCOMMITTED和READ COMMITTED的健壮性问题，这两个级别是根据禁止的脏写和脏读模式定义的。本文的第一个主要贡献是，我们根据不存在特定形式的反例调度(分裂和多分裂调度)和不存在满足各种性质的干涉图中的循环来描述对两种隔离级别的鲁棒性。与Fekete的工作的一个关键区别是，本文中获得的循环属性必须考虑事务中操作的相对顺序，因为READ UNCOMMITTED和READ COMMITTED不能满足原子可见性要求。一个特别的结果是后者呈现了针对READ COMMITTED cop -complete的鲁棒性问题。本文的第二个主要贡献是conp硬度证明。对于READ UNCOMMITTED，我们获得了日志空间完整性。

{"title":"Deciding Robustness for Lower SQL Isolation Levels","authors":"Bas Ketsman, Christoph E. Koch, F. Neven, Brecht Vandevoort","doi":"10.1145/3561049","DOIUrl":"https://doi.org/10.1145/3561049","url":null,"abstract":"While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this article, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED, which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this article is that we characterize robustness against both isolation levels in terms of the absence of counter-example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete’s work, is that the properties of cycles obtained in this article have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this article is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"47 1","pages":"1 - 41"},"PeriodicalIF":1.8,"publicationDate":"2022-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46309024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Persistent Summaries 持续的总结

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-08-18 DOI: https://dl.acm.org/doi/10.1145/3531053

Tianjing Zeng, Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, Ji-Rong Wen

A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried.

Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today’s big data applications, in particular, for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which uses only sublinear space by sacrificing slightly on accuracy.

All streaming algorithms work by maintaining a small data structure in memory, which is usually called a sketch, summary, or synopsis. The summary is updated upon the arrival of every element in the stream, thus it is ephemeral, meaning that it can only answer queries about the current status of the stream. In this article, we aim at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.

持久数据结构，在数据库文献中也称为多版本数据结构，是一种随着时间的推移而更新时保留其所有以前版本的数据结构。对数据结构的每次更新(插入、删除或更改数据记录)都会创建一个新版本，而所有版本都保留在数据结构中，以便仍然可以查询以前的任何版本。持久数据结构的目标是准确地记录所有版本，这导致空间需求至少与更新次数成线性关系。在当今的许多大数据应用中，特别是对于高速流数据，数据的数量和速度是如此之高，以至于我们无法承担存储所有数据的费用。因此，流算法以牺牲精度为代价，只占用亚线性空间，受到了研究界的广泛关注。所有流算法都是通过在内存中维护一个小的数据结构来工作的，这个数据结构通常被称为草图、摘要或概要。摘要在流中的每个元素到达时更新，因此它是短暂的，这意味着它只能回答关于流当前状态的查询。在本文中，我们的目标是设计持久的摘要，从而使流算法能够在任何先前的时间回答关于流的查询。

{"title":"Persistent Summaries","authors":"Tianjing Zeng, Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, Ji-Rong Wen","doi":"https://dl.acm.org/doi/10.1145/3531053","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3531053","url":null,"abstract":"A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried.Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today’s big data applications, in particular, for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which uses only sublinear space by sacrificing slightly on accuracy.All streaming algorithms work by maintaining a small data structure in memory, which is usually called a sketch, summary, or synopsis. The summary is updated upon the arrival of every element in the stream, thus it is ephemeral, meaning that it can only answer queries about the current status of the stream. In this article, we aim at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 4","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138508998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On Finding Rank Regret Representatives 关于寻找等级遗憾代表

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-08-18 DOI: https://dl.acm.org/doi/10.1145/3531054

Abolfazl Asudeh, Gautam Das, H. V. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, Nan Zhang, Jianwen Zhao

Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and instead just limit the users’ “regret.” Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function.

However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in two-dimensional space but is NP-hard on three or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, and so on. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.

在数据集中选择最佳项是数据探索中的常见任务。然而，“最佳”的概念存在于观察者的眼中:不同的用户可能认为不同的属性更重要，因此得出不同的排名。然而，我们可以删除“主导”项目，并创建数据的“代表性”子集，包括其中的“最佳项目”。帕累托最优代表保证包含每个可能排名的最佳项，但它可能是数据的很大一部分。如果我们放宽为每个用户提供最佳商品的要求，而不是仅仅限制用户的“后悔”，那么代表性就会小得多。现有的工作将遗憾定义为对任何选择的排名函数，通过限制对代表性而不是完整数据集的考虑而导致的分数损失。然而，分数通常不是一个有意义的数字，用户可能不理解它的绝对值。有时分数的小范围可以包含数据集的很大一部分。相比之下，用户确实理解排名排序的概念。因此，我们在定义后悔时考虑了项目在排名列表中的位置，并提出了排名-后悔代表作为包含任何可能的排名函数的top-k中至少一个的数据的最小子集。这个问题在二维空间中是多项式时间可解的，但在三维或更多维度上是np困难的。我们设计了一套算法来满足不同的目的，例如是否允许k松弛，结果大小，或两者兼有，是否已知分布，理论保证或实际效率是重要的，等等。在真实数据集上的实验表明，我们可以有效地找到具有小秩遗憾的小子集。

{"title":"On Finding Rank Regret Representatives","authors":"Abolfazl Asudeh, Gautam Das, H. V. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, Nan Zhang, Jianwen Zhao","doi":"https://dl.acm.org/doi/10.1145/3531054","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3531054","url":null,"abstract":"Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be a large portion of data. A much smaller representative can be found if we relax the requirement of including the best item for each user and instead just limit the users’ “regret.” Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full dataset, for any chosen ranking function.However, the score is often not a meaningful number, and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the dataset. In contrast, users do understand the notion of rank ordering. Therefore, we consider items’ positions in the ranked list in defining the regret and propose the rank-regret representative as the minimal subset of the data containing at least one of the top-k of any possible ranking function. This problem is polynomial time solvable in two-dimensional space but is NP-hard on three or more dimensions. We design a suite of algorithms to fulfill different purposes, such as whether relaxation is permitted on k, the result size, or both, whether a distribution is known, whether theoretical guarantees or practical efficiency is important, and so on. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"9 1-2","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Influence Maximization Revisited: Efficient Sampling with Bound Tightened 影响最大化再论:边界收紧的有效采样

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-08-18 DOI: https://dl.acm.org/doi/10.1145/3533817

Qintian Guo, Sibo Wang, Zhewei Wei, Wenqing Lin, Jing Tang

Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log n/ε²) expected time while returning a (1 - 1/e - ε) approximate solution with at least 1 - 1/n probability. A key phase of these IM algorithms is the random reverse reachable (RR) set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms.

In this article, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k ċ n log n ċ²), when for any node v, the total weight of its incoming edges is no larger than a constant. For the general IC model where the weights are skewed, we present a sampling algorithm SKIP. To the best of our knowledge, it is the first index-free algorithm that achieves the optimal time complexity of the sorted subset sampling problem.

Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment.

Besides, we investigate the issues of forward propagation and derive its time complexity with our proposed subset sampling techniques. We also present a heuristic condition to indicate when the forward propagation approach should be utilized to estimate the expected influence of a given seed set.

给定一个有n个节点和m条边的社交网络G，一个正整数k和一个级联模型C，影响最大化(IM)问题要求在G中有k个节点，使得在级联模型C下受k个节点影响的节点的期望数量最大化。最先进的近似解在O(k(n+m)log n/ε2)预期时间内运行，同时返回(1 - 1/e - ε)近似解，概率至少为1 - 1/n。随机反向可达集的生成是IM算法的一个关键阶段，这一阶段对当前IM算法的效率和可扩展性有重要影响。本文对这一关键阶段进行了研究，提出了一种高效的IC模型下随机RR集生成算法。利用新算法，我们证明了现有IM算法在IC模型下的期望运行时间可以提高到O(k * * n log n ċ2)，对于任何节点v，其传入边的总权重不大于一个常数。对于权重偏斜的一般集成电路模型，我们提出了一种SKIP采样算法。据我们所知，它是第一个实现排序子集采样问题最优时间复杂度的无索引算法。此外，现有的近似IM算法在高影响力网络中存在可扩展性问题，其中随机RR集的大小通常相当大。我们通过在不牺牲近似保证的情况下减少随机RR集的平均大小来解决这个具有挑战性的问题。正如我们的实验所示，所提出的解决方案比目前的技术状态快了几个数量级。此外，我们还研究了前向传播问题，并利用我们提出的子集采样技术推导了前向传播的时间复杂度。我们还提出了一个启发式条件，以指示何时应使用前向传播方法来估计给定种子集的预期影响。

{"title":"Influence Maximization Revisited: Efficient Sampling with Bound Tightened","authors":"Qintian Guo, Sibo Wang, Zhewei Wei, Wenqing Lin, Jing Tang","doi":"https://dl.acm.org/doi/10.1145/3533817","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3533817","url":null,"abstract":"Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log n/ε2) expected time while returning a (1 - 1/e - ε) approximate solution with at least 1 - 1/n probability. A key phase of these IM algorithms is the random reverse reachable (RR) set generation, and this phase significantly affects the efficiency and scalability of the state-of-the-art IM algorithms.In this article, we present a study on this key phase and propose an efficient random RR set generation algorithm under IC model. With the new algorithm, we show that the expected running time of existing IM algorithms under IC model can be improved to O(k ċ n log n ċ2), when for any node v, the total weight of its incoming edges is no larger than a constant. For the general IC model where the weights are skewed, we present a sampling algorithm SKIP. To the best of our knowledge, it is the first index-free algorithm that achieves the optimal time complexity of the sorted subset sampling problem.Moreover, existing approximate IM algorithms suffer from scalability issues in high influence networks where the size of random RR sets is usually quite large. We tackle this challenging issue by reducing the average size of random RR sets without sacrificing the approximation guarantee. The proposed solution is orders of magnitude faster than states of the art as shown in our experiment.Besides, we investigate the issues of forward propagation and derive its time complexity with our proposed subset sampling techniques. We also present a heuristic condition to indicate when the forward propagation approach should be utilized to estimate the expected influence of a given seed set.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"6 4","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration 用随机访问和随机顺序枚举回答连接查询的并集

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2022-08-18 DOI: https://dl.acm.org/doi/10.1145/3531055

Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, Nicole Schweikardt

As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmic-delay enumeration. In this article, we seek a structure that supports the more demanding task of a “random permutation”: polylogarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically meaningful manner. An even more demanding task is that of “random access”: polylogarithmic-time retrieval of an answer whose position is given.

We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. We identify a fragment of such UCQs where we can guarantee random access with polylogarithmic access time (and linear-time preprocessing) and a more general fragment where we can guarantee tractable random permutation. For general unions of free-connex acyclic CQs, we devise two algorithms with relaxed guarantees: one has logarithmic delay in expectation, and the other provides a permutation that is almost uniformly distributed. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our random-order enumeration approach over state-of-the-art alternatives.

随着数据分析对数字系统变得越来越重要，对数据库查询进行特征化以进行更有效的评估也变得越来越重要。我们考虑了经过线性时间预处理阶段后具有多对数延迟的回答枚举的可追溯性尺度。通过在预处理阶段构造一个支持多对数延迟枚举的数据结构，可以获得这样的求值。在本文中，我们寻求一种结构来支持更苛刻的“随机排列”任务:真正随机顺序的多对数延迟枚举。如果下游应用程序假设中间结果以统计上有意义的方式代表整个结果集，则需要进行这种枚举。一个要求更高的任务是“随机存取”:用多对数时间检索给定位置的答案。证明了自由连通无环cq在枚举、随机顺序枚举和随机访问三种意义上都是可处理的;在没有自连接的情况下，从过去的结果可以得出，其他CQ对这三个CQ都是难以处理的(在一些细粒度的复杂性假设下)。然而，在cq的并集(UCQ)的情况下，这三个尺度是分开的:虽然自由连接的无环cq的并集具有可处理的枚举，但它可能(可证明地)不允许随机访问。我们确定了这样的ucq的一个片段，我们可以保证随机访问的多对数访问时间(和线性时间预处理)和一个更一般的片段，我们可以保证可处理的随机排列。对于自由连接无环cq的一般并集，我们设计了两种具有宽松保证的算法:一种算法具有对数期望延迟，另一种算法提供了几乎均匀分布的排列。最后，我们提出了一个实现和实证研究，表明我们的随机顺序枚举方法比最先进的替代方法具有相当大的实际优势。

{"title":"Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration","authors":"Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, Nicole Schweikardt","doi":"https://dl.acm.org/doi/10.1145/3531055","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3531055","url":null,"abstract":"As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports polylogarithmic-delay enumeration. In this article, we seek a structure that supports the more demanding task of a “random permutation”: polylogarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically meaningful manner. An even more demanding task is that of “random access”: polylogarithmic-time retrieval of an answer whose position is given.We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. We identify a fragment of such UCQs where we can guarantee random access with polylogarithmic access time (and linear-time preprocessing) and a more general fragment where we can guarantee tractable random permutation. For general unions of free-connex acyclic CQs, we devise two algorithms with relaxed guarantees: one has logarithmic delay in expectation, and the other provides a permutation that is almost uniformly distributed. Finally, we present an implementation and an empirical study that show a considerable practical superiority of our random-order enumeration approach over state-of-the-art alternatives.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"84 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0