对用于重复数据删除的内容定义分块算法的深入研究

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-09 DOI:arxiv-2409.06066

Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse

{"title":"对用于重复数据删除的内容定义分块算法的深入研究","authors":"Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse","doi":"arxiv-2409.06066","DOIUrl":null,"url":null,"abstract":"Data deduplication emerged as a powerful solution for reducing storage and\nbandwidth costs by eliminating redundancies at the level of chunks. This has\nspurred the development of numerous Content-Defined Chunking (CDC) algorithms\nover the past two decades. Despite advancements, the current state-of-the-art\nremains obscure, as a thorough and impartial analysis and comparison is\nlacking. We conduct a rigorous theoretical analysis and impartial experimental\ncomparison of several leading CDC algorithms. Using four realistic datasets, we\nevaluate these algorithms against four key metrics: throughput, deduplication\nratio, average chunk size, and chunk-size variance. Our analyses, in many\ninstances, extend the findings of their original publications by reporting new\nresults and putting existing ones into context. Moreover, we highlight\nlimitations that have previously gone unnoticed. Our findings provide valuable\ninsights that inform the selection and optimization of CDC algorithms for\npractical applications in data deduplication.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication\",\"authors\":\"Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse\",\"doi\":\"arxiv-2409.06066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data deduplication emerged as a powerful solution for reducing storage and\\nbandwidth costs by eliminating redundancies at the level of chunks. This has\\nspurred the development of numerous Content-Defined Chunking (CDC) algorithms\\nover the past two decades. Despite advancements, the current state-of-the-art\\nremains obscure, as a thorough and impartial analysis and comparison is\\nlacking. We conduct a rigorous theoretical analysis and impartial experimental\\ncomparison of several leading CDC algorithms. Using four realistic datasets, we\\nevaluate these algorithms against four key metrics: throughput, deduplication\\nratio, average chunk size, and chunk-size variance. Our analyses, in many\\ninstances, extend the findings of their original publications by reporting new\\nresults and putting existing ones into context. Moreover, we highlight\\nlimitations that have previously gone unnoticed. Our findings provide valuable\\ninsights that inform the selection and optimization of CDC algorithms for\\npractical applications in data deduplication.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"58 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06066\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

重复数据删除是通过消除块级冗余来降低存储和带宽成本的强大解决方案。这促使在过去二十年中开发了大量内容定义分块（CDC）算法。尽管取得了进步，但由于缺乏全面、公正的分析和比较，目前的先进水平仍不明显。我们对几种领先的 CDC 算法进行了严格的理论分析和公正的实验比较。我们使用四个现实数据集，根据四个关键指标对这些算法进行了评估：吞吐量、重复数据删除比率、平均块大小和块大小差异。在许多情况下，我们的分析通过报告新结果并结合现有结果，扩展了原始出版物的研究结果。此外，我们还强调了以前未被注意到的局限性。我们的研究结果提供了宝贵的见解，为重复数据删除实际应用中 CDC 算法的选择和优化提供了参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量

期刊最新文献

Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844