ACM Transactions on Database Systems最新文献_第2页

Ad Hoc Transactions through the Looking Glass: An Empirical Study of Application-Level Transactions in Web Applications 透过望远镜看特设事务：网络应用程序中应用级事务的实证研究

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-12-23 DOI: 10.1145/3638553

Zhaoguo Wang, Chuzhe Tang, Xiaodong Zhang, Qianmian Yu, Binyu Zang, Haibing Guan, Haibo Chen

Many transactions in web applications are constructed ad hoc in the application code. For example, developers might explicitly use locking primitives or validation procedures to coordinate critical code fragments. We refer to database operations coordinated by application code as ad hoc transactions. Until now, little is known about them. This paper presents the first comprehensive study on ad hoc transactions. By studying 91 ad hoc transactions among 8 popular open-source web applications, we found that (i) every studied application uses ad hoc transactions (up to 16 per application), 71 of which play critical roles; (ii) compared with database transactions, concurrency control of ad hoc transactions is much more flexible; (iii) ad hoc transactions are error-prone—53 of them have correctness issues, and 33 of them are confirmed by developers; and (iv) ad hoc transactions have the potential for improving performance in contentious workloads by utilizing application semantics such as access patterns. Based on these findings, we discuss the implications of ad hoc transactions to the database research community.

网络应用程序中的许多事务都是在应用程序代码中临时构建的。例如，开发人员可能会明确使用锁定原语或验证程序来协调关键代码片段。我们将由应用代码协调的数据库操作称为临时事务。到目前为止，人们对它们知之甚少。本文首次对临时事务进行了全面研究。通过研究 8 个流行的开源网络应用程序中的 91 个临时事务，我们发现：(i) 每个研究的应用程序都使用临时事务（每个应用程序多达 16 个），其中 71 个发挥了关键作用；(ii) 与数据库事务相比，临时事务的并发控制要灵活得多；(iii) 临时事务容易出错--其中 53 个存在正确性问题，33 个得到了开发人员的确认；(iv) 临时事务有可能通过利用访问模式等应用程序语义来提高有争议工作负载的性能。基于这些发现，我们讨论了临时事务对数据库研究界的影响。

{"title":"Ad Hoc Transactions through the Looking Glass: An Empirical Study of Application-Level Transactions in Web Applications","authors":"Zhaoguo Wang, Chuzhe Tang, Xiaodong Zhang, Qianmian Yu, Binyu Zang, Haibing Guan, Haibo Chen","doi":"10.1145/3638553","DOIUrl":"https://doi.org/10.1145/3638553","url":null,"abstract":"Many transactions in web applications are constructed ad hoc in the application code. For example, developers might explicitly use locking primitives or validation procedures to coordinate critical code fragments. We refer to database operations coordinated by application code as ad hoc transactions. Until now, little is known about them. This paper presents the first comprehensive study on ad hoc transactions. By studying 91 ad hoc transactions among 8 popular open-source web applications, we found that (i) every studied application uses ad hoc transactions (up to 16 per application), 71 of which play critical roles; (ii) compared with database transactions, concurrency control of ad hoc transactions is much more flexible; (iii) ad hoc transactions are error-prone—53 of them have correctness issues, and 33 of them are confirmed by developers; and (iv) ad hoc transactions have the potential for improving performance in contentious workloads by utilizing application semantics such as access patterns. Based on these findings, we discuss the implications of ad hoc transactions to the database research community.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"10 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139031463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partial Order Multiway Search 偏序多路搜索

2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-11-13 DOI: 10.1145/3626956

Shangqi Lu, Wim Martens, Matthias Niewerth, Yufei Tao

Partial order multiway search (POMS) is a fundamental problem that finds applications in crowdsourcing, distributed file systems, software testing, and more. This problem involves an interaction between an algorithm 𝒜 and an oracle, conducted on a directed acyclic graph 𝒢 known to both parties. Initially, the oracle selects a vertex t in 𝒢 called the target . Subsequently, 𝒜 must identify the target vertex by probing reachability. In each probe , 𝒜 selects a set Q of vertices in 𝒢, the number of which is limited by a pre-agreed value k . The oracle then reveals, for each vertex q ∈ Q , whether q can reach the target in 𝒢. The objective of 𝒜 is to minimize the number of probes. We propose an algorithm to solve POMS in (O(log _{1+k} n + frac{d}{k} log _{1+d} n)) probes, where n represents the number of vertices in 𝒢, and d denotes the largest out-degree of the vertices in 𝒢. The probing complexity is asymptotically optimal. Our study also explores two new POMS variants: The first one, named taciturn POMS , is similar to classical POMS but assumes a weaker oracle, and the second one, named EM POMS , is a direct extension of classical POMS to the external memory (EM) model. For both variants, we introduce algorithms whose performance matches or nearly matches the corresponding theoretical lower bounds.

偏序多路搜索(POMS)是在众包、分布式文件系统、软件测试等领域找到应用程序的一个基本问题。这个问题涉及到一个算法和一个oracle之间的交互，在一个双方都知道的有向无环图𝒢上进行。最初，oracle选择𝒢中的顶点t称为目标。然后，通过探测可达性来确定目标顶点。在每个探测中，在𝒢中选择一个集合Q的顶点，其数量受预先商定的值k的限制。然后，oracle显示，对于每个顶点q∈q, q是否可以到达𝒢中的目标。目的是使探测次数最小化。我们提出了一种在(O(log _{1+k} n + frac{d}{k} log _{1+d} n))探针中求解POMS的算法，其中n表示𝒢中顶点的个数，d表示𝒢中顶点的最大出度。探测复杂度是渐近最优的。我们的研究还探索了两种新的POMS变体:第一种被称为沉默POMS，与经典POMS相似，但假设一个较弱的oracle;第二种被称为EM POMS，是经典POMS对外部记忆(EM)模型的直接扩展。对于这两种变体，我们引入了性能匹配或接近匹配相应理论下界的算法。

{"title":"Partial Order Multiway Search","authors":"Shangqi Lu, Wim Martens, Matthias Niewerth, Yufei Tao","doi":"10.1145/3626956","DOIUrl":"https://doi.org/10.1145/3626956","url":null,"abstract":"Partial order multiway search (POMS) is a fundamental problem that finds applications in crowdsourcing, distributed file systems, software testing, and more. This problem involves an interaction between an algorithm 𝒜 and an oracle, conducted on a directed acyclic graph 𝒢 known to both parties. Initially, the oracle selects a vertex t in 𝒢 called the target . Subsequently, 𝒜 must identify the target vertex by probing reachability. In each probe , 𝒜 selects a set Q of vertices in 𝒢, the number of which is limited by a pre-agreed value k . The oracle then reveals, for each vertex q ∈ Q , whether q can reach the target in 𝒢. The objective of 𝒜 is to minimize the number of probes. We propose an algorithm to solve POMS in (O(log _{1+k} n + frac{d}{k} log _{1+d} n)) probes, where n represents the number of vertices in 𝒢, and d denotes the largest out-degree of the vertices in 𝒢. The probing complexity is asymptotically optimal. Our study also explores two new POMS variants: The first one, named taciturn POMS , is similar to classical POMS but assumes a weaker oracle, and the second one, named EM POMS , is a direct extension of classical POMS to the external memory (EM) model. For both variants, we introduce algorithms whose performance matches or nearly matches the corresponding theoretical lower bounds.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134992701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage Systems 分级存储大数据平台基于成本的数据预取与调度

2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-11-13 DOI: 10.1145/3625389

Herodotos Herodotou, Elena Kakoulli

The use of storage tiering is becoming popular in data-intensive compute clusters due to the recent advancements in storage technologies. The Hadoop Distributed File System, for example, now supports storing data in memory, SSDs, and HDDs, while OctopusFS and hatS offer fine-grained storage tiering solutions. However, current big data platforms (such as Hadoop and Spark) are not exploiting the presence of storage tiers and the opportunities they present for performance optimizations. Specifically, schedulers and prefetchers will make decisions only based on data locality information and completely ignore the fact that local data are now stored on a variety of storage media with different performance characteristics. This article presents Trident, a scheduling and prefetching framework that is designed to make task assignment, resource scheduling, and prefetching decisions based on both locality and storage tier information. Trident formulates task scheduling as a minimum cost maximum matching problem in a bipartite graph and utilizes two novel pruning algorithms for bounding the size of the graph, while still guaranteeing optimality. In addition, Trident extends YARN’s resource request model and proposes a new storage-tier-aware resource scheduling algorithm. Finally, Trident includes a cost-based data prefetching approach that coordinates with the schedulers for optimizing prefetching operations. Trident is implemented in both Spark and Hadoop and evaluated extensively using a realistic workload derived from Facebook traces as well as an industry-validated benchmark, demonstrating significant benefits in terms of application performance and cluster efficiency.

由于存储技术的最新进步，存储分层的使用在数据密集型计算集群中变得越来越流行。例如，Hadoop分布式文件系统现在支持在内存、ssd和hdd中存储数据，而OctopusFS和hatS提供细粒度的存储分层解决方案。然而，当前的大数据平台(如Hadoop和Spark)并没有充分利用存储层的存在及其提供的性能优化机会。具体来说，调度器和预取器将仅根据数据位置信息做出决策，而完全忽略了本地数据现在存储在具有不同性能特征的各种存储介质上这一事实。本文介绍了Trident，这是一个调度和预取框架，旨在根据局部性和存储层信息进行任务分配、资源调度和预取决策。Trident将任务调度描述为二部图中的最小代价最大匹配问题，并利用两种新颖的剪枝算法来限定图的大小，同时仍然保证最优性。此外，Trident扩展了YARN的资源请求模型，提出了一种新的存储层感知的资源调度算法。最后，Trident包括一种基于成本的数据预取方法，该方法与调度器协调，以优化预取操作。Trident在Spark和Hadoop中都实现了，并使用来自Facebook跟踪的实际工作负载以及经过行业验证的基准进行了广泛的评估，证明了在应用程序性能和集群效率方面的显著优势。

{"title":"Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage Systems","authors":"Herodotos Herodotou, Elena Kakoulli","doi":"10.1145/3625389","DOIUrl":"https://doi.org/10.1145/3625389","url":null,"abstract":"The use of storage tiering is becoming popular in data-intensive compute clusters due to the recent advancements in storage technologies. The Hadoop Distributed File System, for example, now supports storing data in memory, SSDs, and HDDs, while OctopusFS and hatS offer fine-grained storage tiering solutions. However, current big data platforms (such as Hadoop and Spark) are not exploiting the presence of storage tiers and the opportunities they present for performance optimizations. Specifically, schedulers and prefetchers will make decisions only based on data locality information and completely ignore the fact that local data are now stored on a variety of storage media with different performance characteristics. This article presents Trident, a scheduling and prefetching framework that is designed to make task assignment, resource scheduling, and prefetching decisions based on both locality and storage tier information. Trident formulates task scheduling as a minimum cost maximum matching problem in a bipartite graph and utilizes two novel pruning algorithms for bounding the size of the graph, while still guaranteeing optimality. In addition, Trident extends YARN’s resource request model and proposes a new storage-tier-aware resource scheduling algorithm. Finally, Trident includes a cost-based data prefetching approach that coordinates with the schedulers for optimizing prefetching operations. Trident is implemented in both Spark and Hadoop and evaluated extensively using a realistic workload derived from Facebook traces as well as an industry-validated benchmark, demonstrating significant benefits in terms of application performance and cluster efficiency.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"3 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134992871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation DomainNet：数据湖消歧中的同源图检测与理解

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-08-05 DOI: 10.1145/3612919

Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

Modern data lakes are heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: How can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management, and data science, we show that data lakes provide a new opportunity for disambiguation of data values, because tables implicitly define a massive network of interconnected values. We introduce DomainNet, which efficiently represents this network, and investigate to what extent it can be used to disambiguate values without requiring any supervision. DomainNet leverages network-centrality measures on a bipartite graph whose nodes represent data values and attributes to determine if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs achieves an F1-score of 0.38 versus 0.69 for DomainNet, which separates homographs well from data values that have a unique meaning. On a real data lake, our top-100 precision is 93%. Given a homograph, we also present a novel method for determining the number of meanings of the homograph and for assigning its data lake attributes to a meaning. We show the influence of homographs on two downstream tasks: entity-matching and domain discovery.

现代数据湖在用于描述数据的词汇表中是异构的。我们研究了数据湖中的消歧义问题:我们如何确定在湖中出现多次的数据值是否具有不同的含义，因此是一个同形词?虽然词和实体消歧已经在计算语言学、数据管理和数据科学中得到了很好的研究，但我们表明，数据湖为数据值消歧提供了一个新的机会，因为表隐式地定义了一个相互关联的值的庞大网络。我们引入了DomainNet，它有效地代表了这个网络，并研究了它在多大程度上可以在不需要任何监督的情况下用于消除值的歧义。DomainNet利用二部图上的网络中心性度量，该二部图的节点表示数据值和属性，以确定值是否为同形图。一个彻底的实验评估表明，最先进的领域发现技术不能被重新用于与我们的方法竞争。具体来说，使用域发现方法识别同音异义词的f1得分为0.38，而DomainNet的f1得分为0.69，后者可以很好地将同音异义词与具有独特含义的数据值区分开来。在真实的数据湖中，我们的前100名精度为93%。给定一个同形词，我们还提出了一种新的方法来确定同形词的意义数量，并将其数据湖属性分配给一个意义。我们展示了同形异义词对两个下游任务的影响:实体匹配和领域发现。

{"title":"DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation","authors":"Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald","doi":"10.1145/3612919","DOIUrl":"https://doi.org/10.1145/3612919","url":null,"abstract":"Modern data lakes are heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: How can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management, and data science, we show that data lakes provide a new opportunity for disambiguation of data values, because tables implicitly define a massive network of interconnected values. We introduce DomainNet, which efficiently represents this network, and investigate to what extent it can be used to disambiguate values without requiring any supervision. DomainNet leverages network-centrality measures on a bipartite graph whose nodes represent data values and attributes to determine if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs achieves an F1-score of 0.38 versus 0.69 for DomainNet, which separates homographs well from data values that have a unique meaning. On a real data lake, our top-100 precision is 93%. Given a homograph, we also present a novel method for determining the number of meanings of the homograph and for assigning its data lake attributes to a meaning. We show the influence of homographs on two downstream tasks: entity-matching and domain discovery.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":" ","pages":"1 - 40"},"PeriodicalIF":1.8,"publicationDate":"2023-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46713250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Bi-objective SQL Optimization for Enclaved Cloud Databases with Differentially Private Padding 基于差分私有填充的封闭云数据库双目标SQL优化

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-06-26 DOI: https://dl.acm.org/doi/10.1145/3597021

Yaxing Chen, Qinghua Zheng, Zheng Yan

Hardware-enabled enclaves have been applied to efficiently enforce data security and privacy protection in cloud database services. Such enclaved systems, however, are reported to suffer from I/O-size (also referred to as communication-volume)-based side-channel attacks. Albeit differentially private padding has been exploited to defend against these attacks as a principle method, it introduces a challenging bi-objective parametric query optimization (BPQO) problem and current solutions are still not satisfactory. Concretely, the goal in BPQO is to find a Pareto-optimal plan that makes a tradeoff between query performance and privacy loss; existing solutions are subjected to poor computational efficiency and high cloud resource waste. In this article, we propose a two-phase optimization algorithm called TPOA to solve the BPQO problem. TPOA incorporates two novel ideas: divide-and-conquer to separately handle parameters according to their types in optimization for dimensionality reduction; on-demand-optimization to progressively build a set of necessary Pareto-optimal plans instead of seeking a complete set for saving resources. Besides, we introduce an acceleration mechanism in TPOA to improve its efficiency, which prunes the non-optimal candidate plans in advance. We theoretically prove the correctness of TPOA, numerically analyze its complexity, and formally give an end-to-end privacy analysis. Through a comprehensive evaluation on its efficiency by running baseline algorithms over synthetic and test-bed benchmarks, we can conclude that TPOA outperforms all benchmarked methods with an overall efficiency improvement of roughly two orders of magnitude; moreover, the acceleration mechanism speeds up TPOA by 10-200×.

支持硬件的enclave已被应用于有效地加强云数据库服务中的数据安全和隐私保护。然而，据报道，这种封闭的系统会遭受基于I/ o大小(也称为通信量)的侧信道攻击。尽管差分私有填充作为一种主要方法被用来防御这些攻击，但它引入了一个具有挑战性的双目标参数查询优化(BPQO)问题，目前的解决方案仍然不令人满意。具体来说，BPQO的目标是找到一个在查询性能和隐私损失之间进行权衡的帕累托最优计划;现有的解决方案存在计算效率低、云资源浪费大的问题。在本文中，我们提出了一种两阶段优化算法TPOA来解决BPQO问题。TPOA融合了两种新颖的思想:分而治之，根据参数的类型分别处理参数进行降维优化;按需优化，逐步构建一套必要的帕累托最优方案，而不是为了节省资源而寻求一套完整的方案。此外，为了提高TPOA算法的效率，我们在TPOA算法中引入了加速机制，提前剔除了非最优候选方案。我们从理论上证明了TPOA的正确性，从数值上分析了其复杂性，并在形式上给出了端到端的隐私分析。通过在综合基准和试验台基准上运行基线算法对其效率进行综合评估，我们可以得出结论:TPOA的总体效率提高了大约两个数量级，优于所有基准方法;加速机构使TPOA提高10-200倍。

{"title":"Efficient Bi-objective SQL Optimization for Enclaved Cloud Databases with Differentially Private Padding","authors":"Yaxing Chen, Qinghua Zheng, Zheng Yan","doi":"https://dl.acm.org/doi/10.1145/3597021","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3597021","url":null,"abstract":"Hardware-enabled enclaves have been applied to efficiently enforce data security and privacy protection in cloud database services. Such enclaved systems, however, are reported to suffer from I/O-size (also referred to as communication-volume)-based side-channel attacks. Albeit differentially private padding has been exploited to defend against these attacks as a principle method, it introduces a challenging bi-objective parametric query optimization (BPQO) problem and current solutions are still not satisfactory. Concretely, the goal in BPQO is to find a Pareto-optimal plan that makes a tradeoff between query performance and privacy loss; existing solutions are subjected to poor computational efficiency and high cloud resource waste. In this article, we propose a two-phase optimization algorithm called TPOA to solve the BPQO problem. TPOA incorporates two novel ideas: divide-and-conquer to separately handle parameters according to their types in optimization for dimensionality reduction; on-demand-optimization to progressively build a set of necessary Pareto-optimal plans instead of seeking a complete set for saving resources. Besides, we introduce an acceleration mechanism in TPOA to improve its efficiency, which prunes the non-optimal candidate plans in advance. We theoretically prove the correctness of TPOA, numerically analyze its complexity, and formally give an end-to-end privacy analysis. Through a comprehensive evaluation on its efficiency by running baseline algorithms over synthetic and test-bed benchmarks, we can conclude that TPOA outperforms all benchmarked methods with an overall efficiency improvement of roughly two orders of magnitude; moreover, the acceleration mechanism speeds up TPOA by 10-200×.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"16 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Model Counting Meets F0 Estimation 模型计数满足F0估计

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-06-20 DOI: 10.1145/3603496

A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel

Constraint satisfaction problems (CSPs) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (F0) for data streams. Our investigations lead us to observe a striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translating algorithms developed for F0 estimation to model counting, resulting in new algorithms for model counting. We also provide a recipe for transforming sampling algorithm over streams to constraint sampling algorithms. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields an algorithm for multidimensional range efficient F0 estimation with a simpler analysis.

约束满足问题（CSP）和数据流模型是两个强大的抽象概念，可以捕捉计算机科学不同领域中出现的各种问题。这两个社区的事态发展大多是独立发生的，它们之间几乎没有互动。在这项工作中，我们试图调查弥合两个社区之间看似沟通的差距是否可以为更丰富的基本见解铺平道路。为此，我们关注两个基本问题：CSP的模型计数和数据流的零频率矩（F0）的计算。我们的研究使我们观察到，在分别为模型计数和F0计算发展的算法框架中使用的核心技术有着惊人的相似性。我们设计了一个配方，用于将为F0估计开发的算法转换为模型计数，从而产生新的模型计数算法。我们还提供了将流上的采样算法转换为约束采样算法的配方。然后我们观察到，分布式流上下文中的算法可以转换为用于模型计数的分布式算法。接下来，我们将注意力转向从计数的角度来看流，并表明将F0估计框定为#DNF计数的特殊情况，可以使我们获得一类丰富的流问题的通用配方，这些问题在以前的工作中已经过具体案例的分析。特别地，我们的观点产生了一种具有更简单分析的多维范围有效F0估计算法。

{"title":"Model Counting Meets F0 Estimation","authors":"A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel","doi":"10.1145/3603496","DOIUrl":"https://doi.org/10.1145/3603496","url":null,"abstract":"Constraint satisfaction problems (CSPs) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (F0) for data streams. Our investigations lead us to observe a striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translating algorithms developed for F0 estimation to model counting, resulting in new algorithms for model counting. We also provide a recipe for transforming sampling algorithm over streams to constraint sampling algorithms. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields an algorithm for multidimensional range efficient F0 estimation with a simpler analysis.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"48 1","pages":"1 - 28"},"PeriodicalIF":1.8,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49071552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Model Counting meets F0 Estimation 模型计数满足F0估计

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-06-20 DOI: https://dl.acm.org/doi/10.1145/3603496

A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel

Constraint satisfaction problems (CSP’s) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (F₀) for data streams.

Our investigations lead us to observe a striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F₀ computation. We design a recipe for translating algorithms developed for F₀ estimation to model counting, resulting in new algorithms for model counting. We also provide a recipe for transforming sampling algorithm over streams to constraint sampling algorithms. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F₀ estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields an algorithm for multidimensional range efficient F₀ estimation with a simpler analysis.

约束满足问题(CSP)和数据流模型是捕获计算机科学不同领域中出现的各种问题的两个强大的抽象。两个社区的发展大多是独立发生的，它们之间很少相互作用。在这项工作中，我们试图调查弥合两个社区之间表面上的沟通差距是否可以为更丰富的基本见解铺平道路。为此，我们关注两个基本问题:CSP的模型计数和数据流的零频率矩(F0)的计算。我们的调查使我们观察到在算法框架中采用的核心技术具有惊人的相似性，这些算法框架分别为模型计数和F0计算而发展。我们设计了一种将用于F0估计的算法转换为模型计数的方法，从而产生了用于模型计数的新算法。我们还提供了一种将采样算法转换为约束采样算法的方法。然后我们观察到分布式流上下文中的算法可以转换为用于模型计数的分布式算法。接下来，我们将注意力转向从计数的角度来观察流，并表明将F0估计作为#DNF计数的特殊情况，使我们能够获得一类丰富的流问题的通用配方，这些问题在以前的工作中已经受到具体情况的分析。特别是，我们的视图通过更简单的分析产生了多维范围有效F0估计的算法。

{"title":"Model Counting meets F0 Estimation","authors":"A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel","doi":"https://dl.acm.org/doi/10.1145/3603496","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3603496","url":null,"abstract":"Constraint satisfaction problems (CSP’s) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (F0) for data streams. Our investigations lead us to observe a striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translating algorithms developed for F0 estimation to model counting, resulting in new algorithms for model counting. We also provide a recipe for transforming sampling algorithm over streams to constraint sampling algorithms. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields an algorithm for multidimensional range efficient F0 estimation with a simpler analysis.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"6 3 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Timely and Persistent Deletion in LSM-Engines 在LSM引擎中实现及时和持久的删除

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-06-08 DOI: 10.1145/3599724

Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis

Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems. Further, we document that LSM-based key-value engines perform suboptimally in the presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations. To address these challenges, in this article, we build a new key-value storage engine, Lethe+, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe+ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17× -1.4×) and lower space amplification (2.1× -9.8×), with a modest increase in write amplification (between 4% and 25%) that can be further amortized to less than 1%. In addition, Lethe+ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.

数据密集型应用程序推动了基于日志结构合并（LSM）的键值引擎的发展，这些引擎采用过时的范式来支持低读/写干扰的高摄取率。然而，这些好处是以将删除视为二等公民为代价的。删除操作会插入一个逻辑删除，该逻辑删除会使已删除密钥的旧实例无效。现有技术的LSM引擎没有提供关于逻辑删除将以多快的速度传播以保持删除的保证。此外，LSM引擎只支持对排序键进行删除。要删除另一个属性（例如，时间戳），需要读取和重写整个树，这会导致不希望的延迟峰值，并增加数据库的总体操作成本。高效和持久的删除是支持的关键：（i）在数据窗口上运行的流媒体系统，（ii）数据删除的隐私和延迟保证，以及（iii）数据系统的集体云部署。此外，我们记录了基于LSM的键值引擎在工作负载中存在删除的情况下执行得次优。逻辑删除驱动的逻辑删除，从设计上讲，无法及时清除已删除的条目，并且保留无效的条目会永久影响LSM引擎在空间放大、写入放大和读取性能方面的整体性能。此外，根据数据隐私保护法规，持久删除的潜在无限延迟带来了关键的隐私问题，如欧盟GDPR中的被遗忘权、加利福尼亚州CCPA和CPRA中的删除权以及弗吉尼亚州VCDPA中的删除权利。为此，我们介绍了LSM树的删除设计空间，并强调了不同类别的删除操作对性能的影响。为了应对这些挑战，在本文中，我们构建了一个新的键值存储引擎Lethe+，该引擎使用了少量的额外元数据、一组新的可识别删除的压缩策略，以及一个编织排序和删除键顺序的新物理数据布局。我们表明，Lethe+支持任何用户定义的删除持久性延迟阈值，提供更高的读取吞吐量（1.17×-1.4×）和更低的空间放大率（2.1×-9.8×），写入放大率适度增加（4%至25%之间），可进一步摊销至1%以下。此外，Lethe+通过删除整个数据页，在不牺牲读取性能或采用代价高昂的全树合并的情况下，支持对辅助删除键进行有效的范围删除。

{"title":"Enabling Timely and Persistent Deletion in LSM-Engines","authors":"Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis","doi":"10.1145/3599724","DOIUrl":"https://doi.org/10.1145/3599724","url":null,"abstract":"Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems. Further, we document that LSM-based key-value engines perform suboptimally in the presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations. To address these challenges, in this article, we build a new key-value storage engine, Lethe+, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe+ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17× -1.4×) and lower space amplification (2.1× -9.8×), with a modest increase in write amplification (between 4% and 25%) that can be further amortized to less than 1%. In addition, Lethe+ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":" ","pages":"1 - 40"},"PeriodicalIF":1.8,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49594609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Timely and Persistent Deletion in LSM-Engines 启用lsm - engine的及时持久删除功能

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-06-08 DOI: https://dl.acm.org/doi/10.1145/3599724

Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis

Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems.

Further, we document that LSM-based key-value engines perform suboptimally in presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations.

To address these challenges, in this article, we build a new key-value storage engine, Lethe⁺, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe⁺ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17 × −1.4 ×) and lower space amplification (2.1 × −9.8 ×), with a modest increase in write amplification (between (4% ) and (25% )) that can be further amortized to less than (1% ). In addition, Lethe⁺ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.

数据密集型应用程序推动了基于日志结构合并(LSM)的键值引擎的发展，这些键值引擎采用不在位置范例来支持高摄取率和低读/写干扰。然而，这些好处是以将删除者视为二等公民为代价的。删除操作插入一个墓碑，使已删除键的旧实例无效。最先进的lsm引擎不能保证墓碑会以多快的速度传播以持久化删除。此外，lsm引擎只支持对排序键进行删除。如果要删除另一个属性(例如，时间戳)，则需要读取和重写整个树，从而导致不希望出现的延迟峰值，并增加数据库的总体操作成本。高效和持久的删除是支持的关键:(i)在数据窗口上运行的流系统，(ii)数据删除的延迟保证隐私，以及(iii)数据系统的大规模云部署。此外，我们还记录了基于lsm的键值引擎在工作负载中存在删除时的性能不是最优的。根据设计，墓碑驱动的逻辑删除无法及时清除已删除的条目，并且永久保留无效的条目会影响lsm引擎在空间放大、写放大和读性能方面的整体性能。此外，根据数据隐私保护法规，持久删除的潜在无限延迟带来了关键的隐私问题，例如欧盟的GDPR中的被遗忘权，加州的CCPA和CPRA中的删除权，以及弗吉尼亚州的VCDPA中的删除权。为此，我们介绍了lsm树的删除设计空间，并强调了不同类型的删除操作对性能的影响。为了应对这些挑战，在本文中，我们构建了一个新的键值存储引擎Lethe+，它使用了非常少量的附加元数据、一组新的感知删除的压缩策略，以及一个新的物理数据布局，该布局将排序和删除键顺序结合在一起。我们证明Lethe+支持任何用户定义的删除持久性延迟阈值，提供更高的读吞吐量(1.17 ×−1.4 ×)和更低的空间放大(2.1 ×−9.8 ×)，写入放大(在(4% )和(25% )之间)有适度的增加，可以进一步摊销到小于(1% )。此外，Lethe+通过删除整个数据页而不牺牲读取性能或使用代价高昂的全树合并，支持在二级删除键上进行有效的范围删除。

{"title":"Enabling Timely and Persistent Deletion in LSM-Engines","authors":"Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis","doi":"https://dl.acm.org/doi/10.1145/3599724","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3599724","url":null,"abstract":"Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems. Further, we document that LSM-based key-value engines perform suboptimally in presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations. To address these challenges, in this article, we build a new key-value storage engine, Lethe+, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe+ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17 × −1.4 ×) and lower space amplification (2.1 × −9.8 ×), with a modest increase in write amplification (between (4% ) and (25% )) that can be further amortized to less than (1% ). In addition, Lethe+ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"243 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reversible Database Watermarking Based on Order-preserving Encryption for Data Sharing 基于保序加密的数据共享可逆数据库水印

IF 1.8 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems

Pub Date : 2023-05-13 DOI: https://dl.acm.org/doi/10.1145/3589761

Donghui Hu, Qing Wang, Song Yan, Xiaojun Liu, Meng Li, Shuli Zheng

In the era of big data, data sharing not only boosts the economy of the world but also brings about problems of privacy disclosure and copyright infringement. The collected data may contain users’ sensitive information; thus, privacy protection should be applied to the data prior to them being shared. Moreover, the shared data may be re-shared to third parties without the consent or awareness of the original data providers. Therefore, there is an urgent need for copyright tracking. There are few works satisfying the requirements of both privacy protection and copyright tracking. The main challenge is how to protect the shared data and realize copyright tracking while not undermining the utility of the data. In this article, we propose a novel solution of a reversible database watermarking scheme based on order-preserving encryption. First, we encrypt the data using order-preserving encryption and adjust an encryption parameter within an appropriate interval to generate a ciphertext with redundant space. Then, we leverage the redundant space to embed robust reversible watermarking. We adopt grouping and K-means to improve the embedding capacity and the robustness of the watermark. Formal theoretical analysis proves that the proposed scheme guarantees correctness and security. Results of extensive experiments show that OPEW has 100% data utility, and the robustness and efficiency of OPEW are better than existing works.

在大数据时代，数据共享在促进世界经济发展的同时，也带来了隐私泄露、版权侵权等问题。收集的数据可能包含用户的敏感信息;因此，在数据被共享之前，应该对数据进行隐私保护。此外，共享的数据可能在未经原始数据提供者同意或知情的情况下被重新共享给第三方。因此，迫切需要对版权进行跟踪。很少有作品能同时满足隐私保护和版权跟踪的要求。如何在不损害数据效用的前提下保护共享数据，实现版权跟踪是目前面临的主要挑战。本文提出了一种基于保序加密的可逆数据库水印方案。首先，我们使用保序加密对数据进行加密，并在适当的间隔内调整加密参数以生成具有冗余空间的密文。然后，利用冗余空间嵌入稳健的可逆水印。我们采用分组和k均值来提高水印的嵌入能力和鲁棒性。形式化的理论分析证明了该方案保证了正确性和安全性。大量实验结果表明，该算法具有100%的数据利用率，鲁棒性和效率均优于现有算法。

{"title":"Reversible Database Watermarking Based on Order-preserving Encryption for Data Sharing","authors":"Donghui Hu, Qing Wang, Song Yan, Xiaojun Liu, Meng Li, Shuli Zheng","doi":"https://dl.acm.org/doi/10.1145/3589761","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3589761","url":null,"abstract":"In the era of big data, data sharing not only boosts the economy of the world but also brings about problems of privacy disclosure and copyright infringement. The collected data may contain users’ sensitive information; thus, privacy protection should be applied to the data prior to them being shared. Moreover, the shared data may be re-shared to third parties without the consent or awareness of the original data providers. Therefore, there is an urgent need for copyright tracking. There are few works satisfying the requirements of both privacy protection and copyright tracking. The main challenge is how to protect the shared data and realize copyright tracking while not undermining the utility of the data. In this article, we propose a novel solution of a reversible database watermarking scheme based on order-preserving encryption. First, we encrypt the data using order-preserving encryption and adjust an encryption parameter within an appropriate interval to generate a ciphertext with redundant space. Then, we leverage the redundant space to embed robust reversible watermarking. We adopt grouping and K-means to improve the embedding capacity and the robustness of the watermark. Formal theoretical analysis proves that the proposed scheme guarantees correctness and security. Results of extensive experiments show that OPEW has 100% data utility, and the robustness and efficiency of OPEW are better than existing works.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"3 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0