Characterizing the efficiency of data deduplication for big data storage management

2013 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2013-09-01 DOI:10.1109/IISWC.2013.6704674

Ruijin Zhou, Ming Liu, Tao Li

{"title":"Characterizing the efficiency of data deduplication for big data storage management","authors":"Ruijin Zhou, Ming Liu, Tao Li","doi":"10.1109/IISWC.2013.6704674","DOIUrl":null,"url":null,"abstract":"The demand for data storage and processing is increasing at a rapid speed in the big data era. Such a tremendous amount of data pushes the limit on storage capacity and on the storage network. A significant portion of the dataset in big data workloads is redundant. As a result, deduplication technology, which removes replicas, becomes an attractive solution to save disk space and traffic in a big data environment. However, the overhead of extra CPU computation (hash indexing) and IO latency introduced by deduplication should be considered. Therefore, the net effect of using deduplication for big data workloads needs to be examined. To this end, we characterize the redundancy of typical big data workloads to justify the need for deduplication. We analyze and characterize the performance and energy impact brought by deduplication under various big data environments. In our experiments, we identify three sources of redundancy in big data workloads: 1) deploying more nodes, 2) expanding the dataset, and 3) using replication mechanisms. We elaborate on the advantages and disadvantages of different deduplication layers, locations, and granularities. In addition, we uncover the relation between energy overhead and the degree of redundancy. Furthermore, we investigate the deduplication efficiency in an SSD environment for big data workloads.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2013.6704674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

Abstract

The demand for data storage and processing is increasing at a rapid speed in the big data era. Such a tremendous amount of data pushes the limit on storage capacity and on the storage network. A significant portion of the dataset in big data workloads is redundant. As a result, deduplication technology, which removes replicas, becomes an attractive solution to save disk space and traffic in a big data environment. However, the overhead of extra CPU computation (hash indexing) and IO latency introduced by deduplication should be considered. Therefore, the net effect of using deduplication for big data workloads needs to be examined. To this end, we characterize the redundancy of typical big data workloads to justify the need for deduplication. We analyze and characterize the performance and energy impact brought by deduplication under various big data environments. In our experiments, we identify three sources of redundancy in big data workloads: 1) deploying more nodes, 2) expanding the dataset, and 3) using replication mechanisms. We elaborate on the advantages and disadvantages of different deduplication layers, locations, and granularities. In addition, we uncover the relation between energy overhead and the degree of redundancy. Furthermore, we investigate the deduplication efficiency in an SSD environment for big data workloads.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在大数据时代，对数据存储和处理的需求正在快速增长。如此巨大的数据量推动了存储容量和存储网络的极限。在大数据工作负载中，数据集的很大一部分是冗余的。因此，删除副本的重复数据删除技术成为大数据环境中节省磁盘空间和流量的一种有吸引力的解决方案。但是，应该考虑额外CPU计算(散列索引)的开销和重复数据删除带来的IO延迟。因此，需要对大数据工作负载使用重复数据删除的净效果进行检查。为此，我们描述了典型大数据工作负载的冗余，以证明重复数据删除的必要性。分析和描述了不同大数据环境下重复数据删除带来的性能和能源影响。在我们的实验中，我们确定了大数据工作负载中的三种冗余来源:1)部署更多节点，2)扩展数据集，以及3)使用复制机制。我们详细阐述了不同的重复数据删除层、位置和粒度的优缺点。此外，我们揭示了能量开销与冗余度之间的关系。此外，我们还研究了SSD环境下大数据工作负载的重复数据删除效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量

期刊最新文献

Pannotia: Understanding irregular GPGPU graph applications Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite - MobileBench Power and performance of GPU-accelerated systems: A closer look Hardware-independent application characterization Performance implications of System Management Mode