From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s Locality

ACM Transactions on Storage (TOS) Pub Date : 2022-06-02 DOI:10.1145/3507921

Xiangyu Zou, Jingsong Yuan, Philip Shilane, Wen Xia, Haijun Zhang, Xuan Wang

{"title":"From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s Locality","authors":"Xiangyu Zou, Jingsong Yuan, Philip Shilane, Wen Xia, Haijun Zhang, Xuan Wang","doi":"10.1145/3507921","DOIUrl":null,"url":null,"abstract":"Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem. This results from the gap between the hyper-dimensional structure of deduplicated data and the sequential nature of many storage devices, and this leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g., rewriting) or caching data in memory or SSD, but fragmentation continues to lower restore and GC performance. Investigating the locality issue, we design a method to flatten the hyper-dimensional structured deduplicated data to a one-dimensional format, which is based on classification of each chunk’s lifecycle, and this creates our proposed data layout. Furthermore, we present a novel management-friendly deduplication framework, called MFDedup, that applies our data layout and maintains locality as much as possible. Specifically, we use two key techniques in MFDedup: Neighbor-duplicate-focus indexing (NDF) and Across-version-aware Reorganization scheme (AVAR). NDF performs duplicate detection against a previous backup, then AVAR rearranges chunks with an offline and iterative algorithm into a compact, sequential layout, which nearly eliminates random I/O during file restores after deduplication. Evaluation results with five backup datasets demonstrate that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12× to 2.19× higher and restore throughputs that are 1.92× to 10.02× faster due to the improved data layout. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. Moreover, the NDF index only requires indices for two backup versions, while the traditional index grows with the number of versions retained.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage (TOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3507921","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem. This results from the gap between the hyper-dimensional structure of deduplicated data and the sequential nature of many storage devices, and this leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g., rewriting) or caching data in memory or SSD, but fragmentation continues to lower restore and GC performance. Investigating the locality issue, we design a method to flatten the hyper-dimensional structured deduplicated data to a one-dimensional format, which is based on classification of each chunk’s lifecycle, and this creates our proposed data layout. Furthermore, we present a novel management-friendly deduplication framework, called MFDedup, that applies our data layout and maintains locality as much as possible. Specifically, we use two key techniques in MFDedup: Neighbor-duplicate-focus indexing (NDF) and Across-version-aware Reorganization scheme (AVAR). NDF performs duplicate detection against a previous backup, then AVAR rearranges chunks with an offline and iterative algorithm into a compact, sequential layout, which nearly eliminates random I/O during file restores after deduplication. Evaluation results with five backup datasets demonstrate that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12× to 2.19× higher and restore throughputs that are 1.92× to 10.02× faster due to the improved data layout. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. Moreover, the NDF index only requires indices for two backup versions, while the traditional index grows with the number of versions retained.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从超维结构到线性结构:重复数据的局部性维护

重复数据删除被广泛用于减少备份工作负载的大小，但它有一个众所周知的缺点，即导致数据局部性差，也称为碎片问题。这是由于重复数据删除数据的超维结构与许多存储设备的顺序特性之间存在差距，从而导致较差的恢复和垃圾收集(GC)性能。目前的研究已经考虑过写副本来保持局部性(例如，重写)或在内存或SSD中缓存数据，但是碎片继续降低恢复和GC性能。针对局部性问题，我们设计了一种基于块生命周期分类的方法，将超维结构化重删数据扁平化为一维格式，从而创建了我们提出的数据布局。此外，我们提出了一种新的管理友好型重复数据删除框架，称为MFDedup，它应用我们的数据布局并尽可能地保持局部性。具体来说，我们在MFDedup中使用了两个关键技术:邻居重复焦点索引(NDF)和跨版本感知重组方案(AVAR)。NDF对以前的备份执行重复检测，然后AVAR使用离线迭代算法将块重新排列成紧凑的顺序布局，这几乎消除了重复数据删除后文件恢复期间的随机I/O。5个备份数据集的评估结果表明，与最先进的技术相比，由于改进的数据布局，MFDedup实现的重复数据删除比率提高了1.12到2.19倍，恢复吞吐量的速度提高了1.92到10.02倍。虽然重新安排阶段引入了开销，但几乎为零的开销GC进程绰绰有余。此外，NDF索引只需要两个备份版本的索引，而传统索引则随着保留版本数量的增加而增长。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Storage (TOS)

自引率

0.00%

发文量

期刊最新文献

WebAssembly-based Delta Sync for Cloud Storage Services DEFUSE: An Interface for Fast and Correct User Space File System Access Donag: Generating Efficient Patches and Diffs for Compressed Archives Building GC-free Key-value Store on HM-SMR Drives with ZoneFS Kangaroo: Theory and Practice of Caching Billions of Tiny Objects on Flash