Estimation of deduplication ratios in large data sets

012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST) Pub Date : 2012-04-16 DOI:10.1109/MSST.2012.6232381

Danny Harnik, Oded Margalit, D. Naor, D. Sotnikov, G. Vernik

{"title":"Estimation of deduplication ratios in large data sets","authors":"Danny Harnik, Oded Margalit, D. Naor, D. Sotnikov, G. Vernik","doi":"10.1109/MSST.2012.6232381","DOIUrl":null,"url":null,"abstract":"We study the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set. This turns out to be a challenging task - It has been shown both empirically and analytically that essentially all of the data at hand needs to be inspected in order to come up with a accurate estimation when deduplication is involved. Moreover, even when permitted to inspect all the data, there are challenges in devising an efficient, yet accurate, method. Efficiency in this case refers to the demanding CPU, memory and disk usage associated with deduplication and compression. Our study focuses on what can be done when scanning the entire data set. We present a novel two-phased framework for such estimations. Our techniques are provably accurate, yet run with very low memory requirements and avoid overheads associated with maintaining large deduplication tables. We give formal proofs of the correctness of our algorithm, compare it to existing techniques from the database and streaming literature and evaluate our technique on a number of real world workloads. For example, we estimate the data reduction ratio of a 7 TB data set with accuracy guarantees of at most a 1% relative error while using as little as 1 MB of RAM (and no additional disk access). In the interesting case of full-file deduplication, our framework readily accepts optimizations that allow estimation on a large data set without reading most of the actual data. For one of the workloads we used in this work we achieved accuracy guarantee of 2% relative error while reading only 27% of the data from disk. Our technique is practical, simple to implement, and useful for multiple scenarios, including estimating the number of disks to buy, choosing a deduplication technique, deciding whether to dedupe or not dedupe and conducting large-scale academic studies related to deduplication ratios.","PeriodicalId":348234,"journal":{"name":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"01 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2012.6232381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

Abstract

We study the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set. This turns out to be a challenging task - It has been shown both empirically and analytically that essentially all of the data at hand needs to be inspected in order to come up with a accurate estimation when deduplication is involved. Moreover, even when permitted to inspect all the data, there are challenges in devising an efficient, yet accurate, method. Efficiency in this case refers to the demanding CPU, memory and disk usage associated with deduplication and compression. Our study focuses on what can be done when scanning the entire data set. We present a novel two-phased framework for such estimations. Our techniques are provably accurate, yet run with very low memory requirements and avoid overheads associated with maintaining large deduplication tables. We give formal proofs of the correctness of our algorithm, compare it to existing techniques from the database and streaming literature and evaluate our technique on a number of real world workloads. For example, we estimate the data reduction ratio of a 7 TB data set with accuracy guarantees of at most a 1% relative error while using as little as 1 MB of RAM (and no additional disk access). In the interesting case of full-file deduplication, our framework readily accepts optimizations that allow estimation on a large data set without reading most of the actual data. For one of the workloads we used in this work we achieved accuracy guarantee of 2% relative error while reading only 27% of the data from disk. Our technique is practical, simple to implement, and useful for multiple scenarios, including estimating the number of disks to buy, choosing a deduplication technique, deciding whether to dedupe or not dedupe and conducting large-scale academic studies related to deduplication ratios.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

估计大型数据集中的重复数据删除比率

我们研究了在特定数据集上精确估计通过重复数据删除和压缩实现的数据缩减率的问题。事实证明这是一项具有挑战性的任务——经验和分析都表明，在涉及重复数据删除时，基本上需要检查手头的所有数据，以便得出准确的估计。此外，即使允许检查所有数据，在设计一种有效而准确的方法方面也存在挑战。在这种情况下，效率是指与重复数据删除和压缩相关的高要求CPU、内存和磁盘使用。我们的研究重点是在扫描整个数据集时可以做些什么。我们提出了一种新的两阶段估计框架。我们的技术可以证明是准确的，但是运行时内存需求非常低，并且避免了维护大型重复数据删除表的开销。我们给出了我们算法正确性的正式证明，将其与数据库和流媒体文献中的现有技术进行比较，并在许多真实世界的工作负载上评估我们的技术。例如，我们估计一个7 TB数据集的数据缩减率，在使用少到1 MB的RAM(并且没有额外的磁盘访问)的情况下，准确度保证最多有1%的相对误差。在有趣的全文件重复数据删除案例中，我们的框架很容易接受这样的优化，即允许在不读取大部分实际数据的情况下对大型数据集进行估计。对于我们在这项工作中使用的一个工作负载，我们在仅从磁盘读取27%数据的情况下实现了2%相对误差的准确性保证。我们的技术实用、易于实现，可用于多种场景，包括估计购买的磁盘数量、选择重复数据删除技术、决定是否进行重复数据删除以及进行与重复数据删除比率相关的大规模学术研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST)

自引率

0.00%

发文量

期刊最新文献

HRAID6ML: A hybrid RAID6 storage architecture with mirrored logging Storage challenges at Los Alamos National Lab Shortcut-JFS: A write efficient journaling file system for phase change memory SLO-aware hybrid store On the speedup of single-disk failure recovery in XOR-coded storage systems: Theory and practice