Enabling Timely and Persistent Deletion in LSM-Engines

IF 2.2 2区 计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Database Systems Pub Date : 2023-06-08 DOI:https://dl.acm.org/doi/10.1145/3599724
Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis
{"title":"Enabling Timely and Persistent Deletion in LSM-Engines","authors":"Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis","doi":"https://dl.acm.org/doi/10.1145/3599724","DOIUrl":null,"url":null,"abstract":"<p>Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the <i>out-of-place</i> paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of <i>treating deletes as second-class citizens</i>. A delete operation inserts a <i>tombstone</i> that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to <i>persist the deletion</i>. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) <i>en masse</i> cloud deployment of data systems. </p><p>Further, we document that LSM-based key-value engines perform suboptimally in presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the <i>right to be forgotten</i> in EU’s GDPR, the <i>right to delete</i> in California’s CCPA and CPRA, and <i>deletion right</i> in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations. </p><p>To address these challenges, in this article, we build a new key-value storage engine, <i>Lethe<sup>+</sup></i>, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that <i>Lethe<sup>+</sup></i> supports any user-defined threshold for the delete persistence latency offering <i>higher read throughput</i> (1.17 × −1.4 ×) and <i>lower space amplification</i> (2.1 × −9.8 ×), with a modest increase in write amplification (between \\(4\\% \\) and \\(25\\% \\)) that can be further amortized to less than \\(1\\% \\). In addition, <i>Lethe<sup>+</sup></i> supports efficient range deletes on a <i>secondary delete key</i> by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"243 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2023-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3599724","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems.

Further, we document that LSM-based key-value engines perform suboptimally in presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations.

To address these challenges, in this article, we build a new key-value storage engine, Lethe+, that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe+ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17 × −1.4 ×) and lower space amplification (2.1 × −9.8 ×), with a modest increase in write amplification (between \(4\% \) and \(25\% \)) that can be further amortized to less than \(1\% \). In addition, Lethe+ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
启用lsm - engine的及时持久删除功能
数据密集型应用程序推动了基于日志结构合并(LSM)的键值引擎的发展,这些键值引擎采用不在位置范例来支持高摄取率和低读/写干扰。然而,这些好处是以将删除者视为二等公民为代价的。删除操作插入一个墓碑,使已删除键的旧实例无效。最先进的lsm引擎不能保证墓碑会以多快的速度传播以持久化删除。此外,lsm引擎只支持对排序键进行删除。如果要删除另一个属性(例如,时间戳),则需要读取和重写整个树,从而导致不希望出现的延迟峰值,并增加数据库的总体操作成本。高效和持久的删除是支持的关键:(i)在数据窗口上运行的流系统,(ii)数据删除的延迟保证隐私,以及(iii)数据系统的大规模云部署。此外,我们还记录了基于lsm的键值引擎在工作负载中存在删除时的性能不是最优的。根据设计,墓碑驱动的逻辑删除无法及时清除已删除的条目,并且永久保留无效的条目会影响lsm引擎在空间放大、写放大和读性能方面的整体性能。此外,根据数据隐私保护法规,持久删除的潜在无限延迟带来了关键的隐私问题,例如欧盟的GDPR中的被遗忘权,加州的CCPA和CPRA中的删除权,以及弗吉尼亚州的VCDPA中的删除权。为此,我们介绍了lsm树的删除设计空间,并强调了不同类型的删除操作对性能的影响。为了应对这些挑战,在本文中,我们构建了一个新的键值存储引擎Lethe+,它使用了非常少量的附加元数据、一组新的感知删除的压缩策略,以及一个新的物理数据布局,该布局将排序和删除键顺序结合在一起。我们证明Lethe+支持任何用户定义的删除持久性延迟阈值,提供更高的读吞吐量(1.17 ×−1.4 ×)和更低的空间放大(2.1 ×−9.8 ×),写入放大(在\(4\% \)和\(25\% \)之间)有适度的增加,可以进一步摊销到小于\(1\% \)。此外,Lethe+通过删除整个数据页而不牺牲读取性能或使用代价高昂的全树合并,支持在二级删除键上进行有效的范围删除。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Database Systems
ACM Transactions on Database Systems 工程技术-计算机:软件工程
CiteScore
5.60
自引率
0.00%
发文量
15
审稿时长
>12 weeks
期刊介绍: Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.
期刊最新文献
Automated Category Tree Construction: Hardness Bounds and Algorithms Database Repairing with Soft Functional Dependencies Sharing Queries with Nonequivalent User-Defined Aggregate Functions A family of centrality measures for graph data based on subgraphs GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1