{"title":"InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization","authors":"Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu","doi":"https://dl.acm.org/doi/10.1145/3568426","DOIUrl":null,"url":null,"abstract":"<p>Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers. </p><p>To address this issue, we propose an inline deduplication approach for storage systems, called <i>InDe</i>, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, <i>valid container referenced counts (VCRC)</i>, to identify appropriate containers for the rewrite. We design a rewrite algorithm <i>F-greedy</i> that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called <i>F-greedy+</i> based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"58 8","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3568426","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers.
To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.
期刊介绍:
The ACM Transactions on Storage (TOS) is a new journal with an intent to publish original archival papers in the area of storage and closely related disciplines. Articles that appear in TOS will tend either to present new techniques and concepts or to report novel experiences and experiments with practical systems. Storage is a broad and multidisciplinary area that comprises of network protocols, resource management, data backup, replication, recovery, devices, security, and theory of data coding, densities, and low-power. Potential synergies among these fields are expected to open up new research directions.