InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization

IF 2.6 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Storage Pub Date : 2022-11-19 DOI:10.1145/3568426

Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu

{"title":"InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization","authors":"Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu","doi":"10.1145/3568426","DOIUrl":null,"url":null,"abstract":"Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers. To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":" ","pages":"1 - 27"},"PeriodicalIF":2.6000,"publicationDate":"2022-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3568426","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 4

Abstract

Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers. To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

InDe:通过自适应检测有效容器利用率的内联重复数据删除方法

内联重复数据消除可在数据发送到存储系统时实时删除冗余数据。然而，它会导致数据碎片化：重复数据消除后，逻辑上连续的块在物理上分散在不同的容器中。许多重写算法旨在通过将碎片化的重复块作为唯一块重写到新的容器中来缓解由于碎片化而导致的性能下降。不幸的是，这些算法基于简单的预设固定值来确定块是否分段，忽略了数据段之间数据特征的差异。因此，在恢复备份时，它们通常无法选择一组合适的旧容器进行重写，从而在检索到的容器中生成大量无效块。为了解决这个问题，我们为存储系统提出了一种称为InDe的内联重复数据消除方法，该方法使用贪婪算法来检测有效的容器利用率，并动态调整每个段中旧容器引用的数量。InDe充分利用重复数据块的分布来提高恢复性能，同时保持高备份性能。我们定义了一个有效性度量，即有效容器引用计数（VCRC），以确定用于重写的适当容器。我们设计了一种重写算法F-贪婪，该算法检测有效的容器利用率来重写低VCRC容器。根据容器的VCRC分布，F-贪婪动态调整旧容器引用的数量，使每个分段只与高利用率容器共享重复的块，从而提高恢复速度。为了充分利用上述特征，我们进一步提出了另一种基于有效容器利用率的自适应区间检测的重写算法，称为F-贪婪+。F-greed+通过检测VCRC在两个方向上的变化趋势并在全局范围内选择引用容器，对旧容器的有效利用率进行了更准确的估计。我们使用三种真实世界的备份工作负载对InDe进行了定量评估。实验结果表明，与两种最先进的算法（Capping和SMR）相比，我们的方案将恢复速度提高了1.3×–2.4×，同时实现了几乎相同的备份性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Storage COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

4.20

自引率

5.90%

发文量

审稿时长

>12 weeks

期刊介绍： The ACM Transactions on Storage (TOS) is a new journal with an intent to publish original archival papers in the area of storage and closely related disciplines. Articles that appear in TOS will tend either to present new techniques and concepts or to report novel experiences and experiments with practical systems. Storage is a broad and multidisciplinary area that comprises of network protocols, resource management, data backup, replication, recovery, devices, security, and theory of data coding, densities, and low-power. Potential synergies among these fields are expected to open up new research directions.