Many transactions in web applications are constructed ad hoc in the application code. For example, developers might explicitly use locking primitives or validation procedures to coordinate critical code fragments. We refer to database operations coordinated by application code as ad hoc transactions. Until now, little is known about them. This paper presents the first comprehensive study on ad hoc transactions. By studying 91 ad hoc transactions among 8 popular open-source web applications, we found that (i) every studied application uses ad hoc transactions (up to 16 per application), 71 of which play critical roles; (ii) compared with database transactions, concurrency control of ad hoc transactions is much more flexible; (iii) ad hoc transactions are error-prone—53 of them have correctness issues, and 33 of them are confirmed by developers; and (iv) ad hoc transactions have the potential for improving performance in contentious workloads by utilizing application semantics such as access patterns. Based on these findings, we discuss the implications of ad hoc transactions to the database research community.
{"title":"Ad Hoc Transactions through the Looking Glass: An Empirical Study of Application-Level Transactions in Web Applications","authors":"Zhaoguo Wang, Chuzhe Tang, Xiaodong Zhang, Qianmian Yu, Binyu Zang, Haibing Guan, Haibo Chen","doi":"10.1145/3638553","DOIUrl":"https://doi.org/10.1145/3638553","url":null,"abstract":"<p>Many transactions in web applications are constructed ad hoc in the application code. For example, developers might explicitly use locking primitives or validation procedures to coordinate critical code fragments. We refer to database operations coordinated by application code as <i>ad hoc transactions</i>. Until now, little is known about them. This paper presents the first comprehensive study on ad hoc transactions. By studying 91 ad hoc transactions among 8 popular open-source web applications, we found that (i) every studied application uses ad hoc transactions (up to 16 per application), 71 of which play critical roles; (ii) compared with database transactions, concurrency control of ad hoc transactions is much more flexible; (iii) ad hoc transactions are error-prone—53 of them have correctness issues, and 33 of them are confirmed by developers; and (iv) ad hoc transactions have the potential for improving performance in contentious workloads by utilizing application semantics such as access patterns. Based on these findings, we discuss the implications of ad hoc transactions to the database research community.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"10 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139031463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shangqi Lu, Wim Martens, Matthias Niewerth, Yufei Tao
Partial order multiway search (POMS) is a fundamental problem that finds applications in crowdsourcing, distributed file systems, software testing, and more. This problem involves an interaction between an algorithm 𝒜 and an oracle, conducted on a directed acyclic graph 𝒢 known to both parties. Initially, the oracle selects a vertex t in 𝒢 called the target . Subsequently, 𝒜 must identify the target vertex by probing reachability. In each probe , 𝒜 selects a set Q of vertices in 𝒢, the number of which is limited by a pre-agreed value k . The oracle then reveals, for each vertex q ∈ Q , whether q can reach the target in 𝒢. The objective of 𝒜 is to minimize the number of probes. We propose an algorithm to solve POMS in (O(log _{1+k} n + frac{d}{k} log _{1+d} n)) probes, where n represents the number of vertices in 𝒢, and d denotes the largest out-degree of the vertices in 𝒢. The probing complexity is asymptotically optimal. Our study also explores two new POMS variants: The first one, named taciturn POMS , is similar to classical POMS but assumes a weaker oracle, and the second one, named EM POMS , is a direct extension of classical POMS to the external memory (EM) model. For both variants, we introduce algorithms whose performance matches or nearly matches the corresponding theoretical lower bounds.
偏序多路搜索(POMS)是在众包、分布式文件系统、软件测试等领域找到应用程序的一个基本问题。这个问题涉及到一个算法和一个oracle之间的交互,在一个双方都知道的有向无环图𝒢上进行。最初,oracle选择𝒢中的顶点t称为目标。然后,通过探测可达性来确定目标顶点。在每个探测中,在𝒢中选择一个集合Q的顶点,其数量受预先商定的值k的限制。然后,oracle显示,对于每个顶点q∈q, q是否可以到达𝒢中的目标。目的是使探测次数最小化。我们提出了一种在(O(log _{1+k} n + frac{d}{k} log _{1+d} n))探针中求解POMS的算法,其中n表示𝒢中顶点的个数,d表示𝒢中顶点的最大出度。探测复杂度是渐近最优的。我们的研究还探索了两种新的POMS变体:第一种被称为沉默POMS,与经典POMS相似,但假设一个较弱的oracle;第二种被称为EM POMS,是经典POMS对外部记忆(EM)模型的直接扩展。对于这两种变体,我们引入了性能匹配或接近匹配相应理论下界的算法。
{"title":"Partial Order Multiway Search","authors":"Shangqi Lu, Wim Martens, Matthias Niewerth, Yufei Tao","doi":"10.1145/3626956","DOIUrl":"https://doi.org/10.1145/3626956","url":null,"abstract":"Partial order multiway search (POMS) is a fundamental problem that finds applications in crowdsourcing, distributed file systems, software testing, and more. This problem involves an interaction between an algorithm 𝒜 and an oracle, conducted on a directed acyclic graph 𝒢 known to both parties. Initially, the oracle selects a vertex t in 𝒢 called the target . Subsequently, 𝒜 must identify the target vertex by probing reachability. In each probe , 𝒜 selects a set Q of vertices in 𝒢, the number of which is limited by a pre-agreed value k . The oracle then reveals, for each vertex q ∈ Q , whether q can reach the target in 𝒢. The objective of 𝒜 is to minimize the number of probes. We propose an algorithm to solve POMS in (O(log _{1+k} n + frac{d}{k} log _{1+d} n)) probes, where n represents the number of vertices in 𝒢, and d denotes the largest out-degree of the vertices in 𝒢. The probing complexity is asymptotically optimal. Our study also explores two new POMS variants: The first one, named taciturn POMS , is similar to classical POMS but assumes a weaker oracle, and the second one, named EM POMS , is a direct extension of classical POMS to the external memory (EM) model. For both variants, we introduce algorithms whose performance matches or nearly matches the corresponding theoretical lower bounds.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"1 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134992701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The use of storage tiering is becoming popular in data-intensive compute clusters due to the recent advancements in storage technologies. The Hadoop Distributed File System, for example, now supports storing data in memory, SSDs, and HDDs, while OctopusFS and hatS offer fine-grained storage tiering solutions. However, current big data platforms (such as Hadoop and Spark) are not exploiting the presence of storage tiers and the opportunities they present for performance optimizations. Specifically, schedulers and prefetchers will make decisions only based on data locality information and completely ignore the fact that local data are now stored on a variety of storage media with different performance characteristics. This article presents Trident, a scheduling and prefetching framework that is designed to make task assignment, resource scheduling, and prefetching decisions based on both locality and storage tier information. Trident formulates task scheduling as a minimum cost maximum matching problem in a bipartite graph and utilizes two novel pruning algorithms for bounding the size of the graph, while still guaranteeing optimality. In addition, Trident extends YARN’s resource request model and proposes a new storage-tier-aware resource scheduling algorithm. Finally, Trident includes a cost-based data prefetching approach that coordinates with the schedulers for optimizing prefetching operations. Trident is implemented in both Spark and Hadoop and evaluated extensively using a realistic workload derived from Facebook traces as well as an industry-validated benchmark, demonstrating significant benefits in terms of application performance and cluster efficiency.
{"title":"Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage Systems","authors":"Herodotos Herodotou, Elena Kakoulli","doi":"10.1145/3625389","DOIUrl":"https://doi.org/10.1145/3625389","url":null,"abstract":"The use of storage tiering is becoming popular in data-intensive compute clusters due to the recent advancements in storage technologies. The Hadoop Distributed File System, for example, now supports storing data in memory, SSDs, and HDDs, while OctopusFS and hatS offer fine-grained storage tiering solutions. However, current big data platforms (such as Hadoop and Spark) are not exploiting the presence of storage tiers and the opportunities they present for performance optimizations. Specifically, schedulers and prefetchers will make decisions only based on data locality information and completely ignore the fact that local data are now stored on a variety of storage media with different performance characteristics. This article presents Trident, a scheduling and prefetching framework that is designed to make task assignment, resource scheduling, and prefetching decisions based on both locality and storage tier information. Trident formulates task scheduling as a minimum cost maximum matching problem in a bipartite graph and utilizes two novel pruning algorithms for bounding the size of the graph, while still guaranteeing optimality. In addition, Trident extends YARN’s resource request model and proposes a new storage-tier-aware resource scheduling algorithm. Finally, Trident includes a cost-based data prefetching approach that coordinates with the schedulers for optimizing prefetching operations. Trident is implemented in both Spark and Hadoop and evaluated extensively using a realistic workload derived from Facebook traces as well as an industry-validated benchmark, demonstrating significant benefits in terms of application performance and cluster efficiency.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"3 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134992871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald
Modern data lakes are heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: How can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management, and data science, we show that data lakes provide a new opportunity for disambiguation of data values, because tables implicitly define a massive network of interconnected values. We introduce DomainNet, which efficiently represents this network, and investigate to what extent it can be used to disambiguate values without requiring any supervision. DomainNet leverages network-centrality measures on a bipartite graph whose nodes represent data values and attributes to determine if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs achieves an F1-score of 0.38 versus 0.69 for DomainNet, which separates homographs well from data values that have a unique meaning. On a real data lake, our top-100 precision is 93%. Given a homograph, we also present a novel method for determining the number of meanings of the homograph and for assigning its data lake attributes to a meaning. We show the influence of homographs on two downstream tasks: entity-matching and domain discovery.
{"title":"DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation","authors":"Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald","doi":"10.1145/3612919","DOIUrl":"https://doi.org/10.1145/3612919","url":null,"abstract":"Modern data lakes are heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: How can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management, and data science, we show that data lakes provide a new opportunity for disambiguation of data values, because tables implicitly define a massive network of interconnected values. We introduce DomainNet, which efficiently represents this network, and investigate to what extent it can be used to disambiguate values without requiring any supervision. DomainNet leverages network-centrality measures on a bipartite graph whose nodes represent data values and attributes to determine if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs achieves an F1-score of 0.38 versus 0.69 for DomainNet, which separates homographs well from data values that have a unique meaning. On a real data lake, our top-100 precision is 93%. Given a homograph, we also present a novel method for determining the number of meanings of the homograph and for assigning its data lake attributes to a meaning. We show the influence of homographs on two downstream tasks: entity-matching and domain discovery.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":" ","pages":"1 - 40"},"PeriodicalIF":1.8,"publicationDate":"2023-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46713250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-26DOI: https://dl.acm.org/doi/10.1145/3597021
Yaxing Chen, Qinghua Zheng, Zheng Yan
Hardware-enabled enclaves have been applied to efficiently enforce data security and privacy protection in cloud database services. Such enclaved systems, however, are reported to suffer from I/O-size (also referred to as communication-volume)-based side-channel attacks. Albeit differentially private padding has been exploited to defend against these attacks as a principle method, it introduces a challenging bi-objective parametric query optimization (BPQO) problem and current solutions are still not satisfactory. Concretely, the goal in BPQO is to find a Pareto-optimal plan that makes a tradeoff between query performance and privacy loss; existing solutions are subjected to poor computational efficiency and high cloud resource waste. In this article, we propose a two-phase optimization algorithm called TPOA to solve the BPQO problem. TPOA incorporates two novel ideas: divide-and-conquer to separately handle parameters according to their types in optimization for dimensionality reduction; on-demand-optimization to progressively build a set of necessary Pareto-optimal plans instead of seeking a complete set for saving resources. Besides, we introduce an acceleration mechanism in TPOA to improve its efficiency, which prunes the non-optimal candidate plans in advance. We theoretically prove the correctness of TPOA, numerically analyze its complexity, and formally give an end-to-end privacy analysis. Through a comprehensive evaluation on its efficiency by running baseline algorithms over synthetic and test-bed benchmarks, we can conclude that TPOA outperforms all benchmarked methods with an overall efficiency improvement of roughly two orders of magnitude; moreover, the acceleration mechanism speeds up TPOA by 10-200×.
{"title":"Efficient Bi-objective SQL Optimization for Enclaved Cloud Databases with Differentially Private Padding","authors":"Yaxing Chen, Qinghua Zheng, Zheng Yan","doi":"https://dl.acm.org/doi/10.1145/3597021","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3597021","url":null,"abstract":"<p>Hardware-enabled enclaves have been applied to efficiently enforce data security and privacy protection in cloud database services. Such enclaved systems, however, are reported to suffer from I/O-size (also referred to as communication-volume)-based side-channel attacks. Albeit differentially private padding has been exploited to defend against these attacks as a principle method, it introduces a challenging bi-objective parametric query optimization (BPQO) problem and current solutions are still not satisfactory. Concretely, the goal in BPQO is to find a Pareto-optimal plan that makes a tradeoff between query performance and privacy loss; existing solutions are subjected to poor computational efficiency and high cloud resource waste. In this article, we propose a two-phase optimization algorithm called TPOA to solve the BPQO problem. TPOA incorporates two novel ideas: <i>divide-and-conquer</i> to separately handle parameters according to their types in optimization for dimensionality reduction; <i>on-demand-optimization</i> to progressively build a set of necessary Pareto-optimal plans instead of seeking a complete set for saving resources. Besides, we introduce an acceleration mechanism in TPOA to improve its efficiency, which prunes the non-optimal candidate plans in advance. We theoretically prove the correctness of TPOA, numerically analyze its complexity, and formally give an end-to-end privacy analysis. Through a comprehensive evaluation on its efficiency by running baseline algorithms over synthetic and test-bed benchmarks, we can conclude that TPOA outperforms all benchmarked methods with an overall efficiency improvement of roughly two orders of magnitude; moreover, the acceleration mechanism speeds up TPOA by 10-200×.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"16 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel
Constraint satisfaction problems (CSPs) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (F0) for data streams. Our investigations lead us to observe a striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translating algorithms developed for F0 estimation to model counting, resulting in new algorithms for model counting. We also provide a recipe for transforming sampling algorithm over streams to constraint sampling algorithms. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields an algorithm for multidimensional range efficient F0 estimation with a simpler analysis.
{"title":"Model Counting Meets F0 Estimation","authors":"A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel","doi":"10.1145/3603496","DOIUrl":"https://doi.org/10.1145/3603496","url":null,"abstract":"Constraint satisfaction problems (CSPs) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (F0) for data streams. Our investigations lead us to observe a striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translating algorithms developed for F0 estimation to model counting, resulting in new algorithms for model counting. We also provide a recipe for transforming sampling algorithm over streams to constraint sampling algorithms. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields an algorithm for multidimensional range efficient F0 estimation with a simpler analysis.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"48 1","pages":"1 - 28"},"PeriodicalIF":1.8,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49071552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-20DOI: https://dl.acm.org/doi/10.1145/3603496
A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel
Constraint satisfaction problems (CSP’s) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (F0) for data streams.
Our investigations lead us to observe a striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0 computation. We design a recipe for translating algorithms developed for F0 estimation to model counting, resulting in new algorithms for model counting. We also provide a recipe for transforming sampling algorithm over streams to constraint sampling algorithms. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0 estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields an algorithm for multidimensional range efficient F0 estimation with a simpler analysis.
{"title":"Model Counting meets F0 Estimation","authors":"A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel","doi":"https://dl.acm.org/doi/10.1145/3603496","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3603496","url":null,"abstract":"<p>Constraint satisfaction problems (CSP’s) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (<i>F</i><sub>0</sub>) for data streams. </p><p>Our investigations lead us to observe a striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and <i>F</i><sub>0</sub> computation. We design a recipe for translating algorithms developed for <i>F</i><sub>0</sub> estimation to model counting, resulting in new algorithms for model counting. We also provide a recipe for transforming sampling algorithm over streams to constraint sampling algorithms. We then observe that algorithms in the context of distributed streaming can be transformed into distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing <i>F</i><sub>0</sub> estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields an algorithm for multidimensional range efficient <i>F</i><sub>0</sub> estimation with a simpler analysis.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"6 3 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems. Further, we document that LSM-based key-value engines perform suboptimally in the presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations. To address these challenges, in this article, we build a new key-value storage engine, Lethe+, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe+ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17× -1.4×) and lower space amplification (2.1× -9.8×), with a modest increase in write amplification (between 4% and 25%) that can be further amortized to less than 1%. In addition, Lethe+ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.
{"title":"Enabling Timely and Persistent Deletion in LSM-Engines","authors":"Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis","doi":"10.1145/3599724","DOIUrl":"https://doi.org/10.1145/3599724","url":null,"abstract":"Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems. Further, we document that LSM-based key-value engines perform suboptimally in the presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations. To address these challenges, in this article, we build a new key-value storage engine, Lethe+, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe+ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17× -1.4×) and lower space amplification (2.1× -9.8×), with a modest increase in write amplification (between 4% and 25%) that can be further amortized to less than 1%. In addition, Lethe+ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":" ","pages":"1 - 40"},"PeriodicalIF":1.8,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49594609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems.
Further, we document that LSM-based key-value engines perform suboptimally in presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations.
To address these challenges, in this article, we build a new key-value storage engine, Lethe+, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe+ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17 × −1.4 ×) and lower space amplification (2.1 × −9.8 ×), with a modest increase in write amplification (between (4% ) and (25% )) that can be further amortized to less than (1% ). In addition, Lethe+ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.
{"title":"Enabling Timely and Persistent Deletion in LSM-Engines","authors":"Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis","doi":"https://dl.acm.org/doi/10.1145/3599724","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3599724","url":null,"abstract":"<p>Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the <i>out-of-place</i> paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of <i>treating deletes as second-class citizens</i>. A delete operation inserts a <i>tombstone</i> that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to <i>persist the deletion</i>. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) <i>en masse</i> cloud deployment of data systems. </p><p>Further, we document that LSM-based key-value engines perform suboptimally in presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the <i>right to be forgotten</i> in EU’s GDPR, the <i>right to delete</i> in California’s CCPA and CPRA, and <i>deletion right</i> in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations. </p><p>To address these challenges, in this article, we build a new key-value storage engine, <i>Lethe<sup>+</sup></i>, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that <i>Lethe<sup>+</sup></i> supports any user-defined threshold for the delete persistence latency offering <i>higher read throughput</i> (1.17 × −1.4 ×) and <i>lower space amplification</i> (2.1 × −9.8 ×), with a modest increase in write amplification (between (4% ) and (25% )) that can be further amortized to less than (1% ). In addition, <i>Lethe<sup>+</sup></i> supports efficient range deletes on a <i>secondary delete key</i> by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"243 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the era of big data, data sharing not only boosts the economy of the world but also brings about problems of privacy disclosure and copyright infringement. The collected data may contain users’ sensitive information; thus, privacy protection should be applied to the data prior to them being shared. Moreover, the shared data may be re-shared to third parties without the consent or awareness of the original data providers. Therefore, there is an urgent need for copyright tracking. There are few works satisfying the requirements of both privacy protection and copyright tracking. The main challenge is how to protect the shared data and realize copyright tracking while not undermining the utility of the data. In this article, we propose a novel solution of a reversible database watermarking scheme based on order-preserving encryption. First, we encrypt the data using order-preserving encryption and adjust an encryption parameter within an appropriate interval to generate a ciphertext with redundant space. Then, we leverage the redundant space to embed robust reversible watermarking. We adopt grouping and K-means to improve the embedding capacity and the robustness of the watermark. Formal theoretical analysis proves that the proposed scheme guarantees correctness and security. Results of extensive experiments show that OPEW has 100% data utility, and the robustness and efficiency of OPEW are better than existing works.
{"title":"Reversible Database Watermarking Based on Order-preserving Encryption for Data Sharing","authors":"Donghui Hu, Qing Wang, Song Yan, Xiaojun Liu, Meng Li, Shuli Zheng","doi":"https://dl.acm.org/doi/10.1145/3589761","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3589761","url":null,"abstract":"<p>In the era of big data, data sharing not only boosts the economy of the world but also brings about problems of privacy disclosure and copyright infringement. The collected data may contain users’ sensitive information; thus, privacy protection should be applied to the data prior to them being shared. Moreover, the shared data may be re-shared to third parties without the consent or awareness of the original data providers. Therefore, there is an urgent need for copyright tracking. There are few works satisfying the requirements of both privacy protection and copyright tracking. The main challenge is how to protect the shared data and realize copyright tracking while not undermining the utility of the data. In this article, we propose a novel solution of a reversible database watermarking scheme based on order-preserving encryption. First, we encrypt the data using order-preserving encryption and adjust an encryption parameter within an appropriate interval to generate a ciphertext with redundant space. Then, we leverage the redundant space to embed robust reversible watermarking. We adopt grouping and K-means to improve the embedding capacity and the robustness of the watermark. Formal theoretical analysis proves that the proposed scheme guarantees correctness and security. Results of extensive experiments show that OPEW has 100% data utility, and the robustness and efficiency of OPEW are better than existing works.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"3 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2023-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138530904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}