Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse
{"title":"对用于重复数据删除的内容定义分块算法的深入研究","authors":"Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse","doi":"arxiv-2409.06066","DOIUrl":null,"url":null,"abstract":"Data deduplication emerged as a powerful solution for reducing storage and\nbandwidth costs by eliminating redundancies at the level of chunks. This has\nspurred the development of numerous Content-Defined Chunking (CDC) algorithms\nover the past two decades. Despite advancements, the current state-of-the-art\nremains obscure, as a thorough and impartial analysis and comparison is\nlacking. We conduct a rigorous theoretical analysis and impartial experimental\ncomparison of several leading CDC algorithms. Using four realistic datasets, we\nevaluate these algorithms against four key metrics: throughput, deduplication\nratio, average chunk size, and chunk-size variance. Our analyses, in many\ninstances, extend the findings of their original publications by reporting new\nresults and putting existing ones into context. Moreover, we highlight\nlimitations that have previously gone unnoticed. Our findings provide valuable\ninsights that inform the selection and optimization of CDC algorithms for\npractical applications in data deduplication.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication\",\"authors\":\"Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse\",\"doi\":\"arxiv-2409.06066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data deduplication emerged as a powerful solution for reducing storage and\\nbandwidth costs by eliminating redundancies at the level of chunks. This has\\nspurred the development of numerous Content-Defined Chunking (CDC) algorithms\\nover the past two decades. Despite advancements, the current state-of-the-art\\nremains obscure, as a thorough and impartial analysis and comparison is\\nlacking. We conduct a rigorous theoretical analysis and impartial experimental\\ncomparison of several leading CDC algorithms. Using four realistic datasets, we\\nevaluate these algorithms against four key metrics: throughput, deduplication\\nratio, average chunk size, and chunk-size variance. Our analyses, in many\\ninstances, extend the findings of their original publications by reporting new\\nresults and putting existing ones into context. Moreover, we highlight\\nlimitations that have previously gone unnoticed. Our findings provide valuable\\ninsights that inform the selection and optimization of CDC algorithms for\\npractical applications in data deduplication.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"58 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06066\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication
Data deduplication emerged as a powerful solution for reducing storage and
bandwidth costs by eliminating redundancies at the level of chunks. This has
spurred the development of numerous Content-Defined Chunking (CDC) algorithms
over the past two decades. Despite advancements, the current state-of-the-art
remains obscure, as a thorough and impartial analysis and comparison is
lacking. We conduct a rigorous theoretical analysis and impartial experimental
comparison of several leading CDC algorithms. Using four realistic datasets, we
evaluate these algorithms against four key metrics: throughput, deduplication
ratio, average chunk size, and chunk-size variance. Our analyses, in many
instances, extend the findings of their original publications by reporting new
results and putting existing ones into context. Moreover, we highlight
limitations that have previously gone unnoticed. Our findings provide valuable
insights that inform the selection and optimization of CDC algorithms for
practical applications in data deduplication.