Characterizing the efficiency of data deduplication for big data storage management

Ruijin Zhou, Ming Liu, Tao Li
{"title":"Characterizing the efficiency of data deduplication for big data storage management","authors":"Ruijin Zhou, Ming Liu, Tao Li","doi":"10.1109/IISWC.2013.6704674","DOIUrl":null,"url":null,"abstract":"The demand for data storage and processing is increasing at a rapid speed in the big data era. Such a tremendous amount of data pushes the limit on storage capacity and on the storage network. A significant portion of the dataset in big data workloads is redundant. As a result, deduplication technology, which removes replicas, becomes an attractive solution to save disk space and traffic in a big data environment. However, the overhead of extra CPU computation (hash indexing) and IO latency introduced by deduplication should be considered. Therefore, the net effect of using deduplication for big data workloads needs to be examined. To this end, we characterize the redundancy of typical big data workloads to justify the need for deduplication. We analyze and characterize the performance and energy impact brought by deduplication under various big data environments. In our experiments, we identify three sources of redundancy in big data workloads: 1) deploying more nodes, 2) expanding the dataset, and 3) using replication mechanisms. We elaborate on the advantages and disadvantages of different deduplication layers, locations, and granularities. In addition, we uncover the relation between energy overhead and the degree of redundancy. Furthermore, we investigate the deduplication efficiency in an SSD environment for big data workloads.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2013.6704674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 39

Abstract

The demand for data storage and processing is increasing at a rapid speed in the big data era. Such a tremendous amount of data pushes the limit on storage capacity and on the storage network. A significant portion of the dataset in big data workloads is redundant. As a result, deduplication technology, which removes replicas, becomes an attractive solution to save disk space and traffic in a big data environment. However, the overhead of extra CPU computation (hash indexing) and IO latency introduced by deduplication should be considered. Therefore, the net effect of using deduplication for big data workloads needs to be examined. To this end, we characterize the redundancy of typical big data workloads to justify the need for deduplication. We analyze and characterize the performance and energy impact brought by deduplication under various big data environments. In our experiments, we identify three sources of redundancy in big data workloads: 1) deploying more nodes, 2) expanding the dataset, and 3) using replication mechanisms. We elaborate on the advantages and disadvantages of different deduplication layers, locations, and granularities. In addition, we uncover the relation between energy overhead and the degree of redundancy. Furthermore, we investigate the deduplication efficiency in an SSD environment for big data workloads.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在大数据时代,对数据存储和处理的需求正在快速增长。如此巨大的数据量推动了存储容量和存储网络的极限。在大数据工作负载中,数据集的很大一部分是冗余的。因此,删除副本的重复数据删除技术成为大数据环境中节省磁盘空间和流量的一种有吸引力的解决方案。但是,应该考虑额外CPU计算(散列索引)的开销和重复数据删除带来的IO延迟。因此,需要对大数据工作负载使用重复数据删除的净效果进行检查。为此,我们描述了典型大数据工作负载的冗余,以证明重复数据删除的必要性。分析和描述了不同大数据环境下重复数据删除带来的性能和能源影响。在我们的实验中,我们确定了大数据工作负载中的三种冗余来源:1)部署更多节点,2)扩展数据集,以及3)使用复制机制。我们详细阐述了不同的重复数据删除层、位置和粒度的优缺点。此外,我们揭示了能量开销与冗余度之间的关系。此外,我们还研究了SSD环境下大数据工作负载的重复数据删除效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Pannotia: Understanding irregular GPGPU graph applications Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite - MobileBench Power and performance of GPU-accelerated systems: A closer look Hardware-independent application characterization Performance implications of System Management Mode
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1