InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization

IF 2.6 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Storage Pub Date : 2023-01-11 DOI:https://dl.acm.org/doi/10.1145/3568426

Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu

{"title":"InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization","authors":"Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu","doi":"https://dl.acm.org/doi/10.1145/3568426","DOIUrl":null,"url":null,"abstract":"Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers. To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"58 8","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3568426","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers.

To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

InDe:通过自适应检测有效容器利用率的内联重复数据删除方法

内联重复数据删除是在数据发送到存储系统的过程中实时删除冗余数据。但是，重复数据删除会导致数据碎片化:重复数据删除后，逻辑上连续的块在物理上分散在不同的容器中。许多重写算法旨在通过将碎片化的重复块作为唯一块重写到新的容器中来缓解由于碎片化而导致的性能下降。不幸的是，这些算法根据一个简单的预先设定的固定值来确定一个块是否被分片，而忽略了数据段之间数据特征的差异。因此，在恢复备份时，它们通常无法选择一组适当的旧容器进行重写，从而在检索到的容器中生成大量无效块。为了解决这个问题，我们提出了一种用于存储系统的内联重复数据删除方法，称为InDe，它使用贪婪算法来检测有效的容器利用率，并动态调整每个段中旧容器引用的数量。InDe充分利用重复块的分布来提高恢复性能，同时保持高备份性能。我们定义了一个有效性度量，即有效的容器引用计数(VCRC)，以确定适合重写的容器。我们设计了一个重写算法F-greedy，它检测有效的容器利用率来重写低vcrc的容器。F-greedy根据容器的VCRC分布动态调整旧容器引用的数量，使每段只与高利用率容器共享重复块，从而提高恢复速度。为了充分利用上述特征，我们进一步提出了另一种基于有效容器利用率自适应间隔检测的重写算法F-greedy+。F-greedy+通过检测VCRC在两个方向上的变化趋势，并在全局范围内选择参考容器，更准确地估计旧容器的有效利用率。我们使用三个真实的备份工作负载对InDe进行了定量评估。实验结果表明，与两种最先进的算法(Capping和SMR)相比，我们的方案在获得几乎相同的备份性能的同时，将恢复速度提高了1.3×-2.4×。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Storage COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

4.20

自引率

5.90%

发文量

审稿时长

>12 weeks

期刊介绍： The ACM Transactions on Storage (TOS) is a new journal with an intent to publish original archival papers in the area of storage and closely related disciplines. Articles that appear in TOS will tend either to present new techniques and concepts or to report novel experiences and experiments with practical systems. Storage is a broad and multidisciplinary area that comprises of network protocols, resource management, data backup, replication, recovery, devices, security, and theory of data coding, densities, and low-power. Potential synergies among these fields are expected to open up new research directions.