Frequency Based Chunking for Data De-Duplication

Guanlin Lu, Yu Jin, D. Du
{"title":"Frequency Based Chunking for Data De-Duplication","authors":"Guanlin Lu, Yu Jin, D. Du","doi":"10.1109/MASCOTS.2010.37","DOIUrl":null,"url":null,"abstract":"A predominant portion of Internet services, like content delivery networks, news broadcasting, blogs sharing and social networks, etc., is data centric. A significant amount of new data is generated by these services each day. To efficiently store and maintain backups for such data is a challenging task for current data storage systems. Chunking based deduplication (dedup) methods are widely used to eliminate redundant data and hence reduce the required total storage space. In this paper, we propose a novel Frequency Based Chunking (FBC) algorithm. Unlike the most popular Content-Defined Chunking (CDC) algorithm which divides the data stream randomly according to the content, FBC explicitly utilizes the chunk frequency information in the data stream to enhance the data deduplication gain especially when the metadata overhead is taken into consideration. The FBC algorithm consists of two components, a statistical chunk frequency estimation algorithm for identifying the globally appeared frequent chunks, and a two-stage chunking algorithm which uses these chunk frequencies to obtain a better chunking result. To evaluate the effectiveness of the proposed FBC algorithm, we conducted extensive experiments on heterogeneous datasets. In all experiments, the FBC algorithm persistently outperforms the CDC algorithm in terms of achieving a better dedup gain or producing much less number of chunks. Particularly, our experiments show that FBC produces 2.5 ~ 4 times less number of chunks than that of a baseline CDC which achieving the same Duplicate Elimination Ratio (DER). Another benefit of FBC over CDC is that the FBC with average chunk size greater than or equal to that of CDC achieves up to 50% higher DER than that of a CDC algorithm.","PeriodicalId":406889,"journal":{"name":"2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"80","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS.2010.37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 80

Abstract

A predominant portion of Internet services, like content delivery networks, news broadcasting, blogs sharing and social networks, etc., is data centric. A significant amount of new data is generated by these services each day. To efficiently store and maintain backups for such data is a challenging task for current data storage systems. Chunking based deduplication (dedup) methods are widely used to eliminate redundant data and hence reduce the required total storage space. In this paper, we propose a novel Frequency Based Chunking (FBC) algorithm. Unlike the most popular Content-Defined Chunking (CDC) algorithm which divides the data stream randomly according to the content, FBC explicitly utilizes the chunk frequency information in the data stream to enhance the data deduplication gain especially when the metadata overhead is taken into consideration. The FBC algorithm consists of two components, a statistical chunk frequency estimation algorithm for identifying the globally appeared frequent chunks, and a two-stage chunking algorithm which uses these chunk frequencies to obtain a better chunking result. To evaluate the effectiveness of the proposed FBC algorithm, we conducted extensive experiments on heterogeneous datasets. In all experiments, the FBC algorithm persistently outperforms the CDC algorithm in terms of achieving a better dedup gain or producing much less number of chunks. Particularly, our experiments show that FBC produces 2.5 ~ 4 times less number of chunks than that of a baseline CDC which achieving the same Duplicate Elimination Ratio (DER). Another benefit of FBC over CDC is that the FBC with average chunk size greater than or equal to that of CDC achieves up to 50% higher DER than that of a CDC algorithm.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于频率的数据重复删除分块
互联网服务的主要部分,如内容交付网络、新闻广播、博客分享和社交网络等,都是以数据为中心的。这些服务每天都会产生大量的新数据。有效地存储和维护这些数据的备份对于当前的数据存储系统来说是一项具有挑战性的任务。基于分块的重复数据删除(dedup)方法被广泛用于消除冗余数据,从而减少所需的总存储空间。在本文中,我们提出了一种新的基于频率的分块算法。与最流行的CDC (content - defined Chunking)算法(根据内容随机划分数据流)不同,FBC明确地利用数据流中的块频率信息来提高重复数据删除的增益,特别是在考虑元数据开销的情况下。FBC算法由两个部分组成,一个是用于识别全局出现的频繁块的统计块频率估计算法,另一个是利用这些块频率获得更好的分块结果的两阶段分块算法。为了评估所提出的FBC算法的有效性,我们在异构数据集上进行了大量的实验。在所有实验中,FBC算法在获得更好的去噪增益或产生更少的块数量方面始终优于CDC算法。特别是,我们的实验表明,在达到相同的重复消除比(DER)的情况下,FBC产生的块数量比基线CDC少2.5 ~ 4倍。与CDC相比,FBC的另一个优点是,平均块大小大于或等于CDC的FBC算法的DER比CDC算法高50%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Effective Quality of Service Differentiation for Real-world Storage Systems Element Based Semantics in Multi Formalism Performance Models Examining Energy Use in Heterogeneous Archival Storage Systems Clasas: A Key-Store for the Cloud Efficient Web Requests Scheduling Considering Resources Sharing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1