A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse
{"title":"A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication","authors":"Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse","doi":"arxiv-2409.06066","DOIUrl":null,"url":null,"abstract":"Data deduplication emerged as a powerful solution for reducing storage and\nbandwidth costs by eliminating redundancies at the level of chunks. This has\nspurred the development of numerous Content-Defined Chunking (CDC) algorithms\nover the past two decades. Despite advancements, the current state-of-the-art\nremains obscure, as a thorough and impartial analysis and comparison is\nlacking. We conduct a rigorous theoretical analysis and impartial experimental\ncomparison of several leading CDC algorithms. Using four realistic datasets, we\nevaluate these algorithms against four key metrics: throughput, deduplication\nratio, average chunk size, and chunk-size variance. Our analyses, in many\ninstances, extend the findings of their original publications by reporting new\nresults and putting existing ones into context. Moreover, we highlight\nlimitations that have previously gone unnoticed. Our findings provide valuable\ninsights that inform the selection and optimization of CDC algorithms for\npractical applications in data deduplication.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
对用于重复数据删除的内容定义分块算法的深入研究
重复数据删除是通过消除块级冗余来降低存储和带宽成本的强大解决方案。这促使在过去二十年中开发了大量内容定义分块(CDC)算法。尽管取得了进步,但由于缺乏全面、公正的分析和比较,目前的先进水平仍不明显。我们对几种领先的 CDC 算法进行了严格的理论分析和公正的实验比较。我们使用四个现实数据集,根据四个关键指标对这些算法进行了评估:吞吐量、重复数据删除比率、平均块大小和块大小差异。在许多情况下,我们的分析通过报告新结果并结合现有结果,扩展了原始出版物的研究结果。此外,我们还强调了以前未被注意到的局限性。我们的研究结果提供了宝贵的见解,为重复数据删除实际应用中 CDC 算法的选择和优化提供了参考。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1