重复数据删除系统中的块大小优化

2009 Data Compression Conference Pub Date : 2009-03-16 DOI:10.1109/DCC.2009.51

C. Constantinescu, J. Pieper, Tiancheng Li

{"title":"重复数据删除系统中的块大小优化","authors":"C. Constantinescu, J. Pieper, Tiancheng Li","doi":"10.1109/DCC.2009.51","DOIUrl":null,"url":null,"abstract":"Data deduplication is a popular dictionary based compression method in storage archival and backup.The deduplication efficiency (``chunk'' matching) improves for smaller chunk sizes, however the files become highly fragmented requiring many disk accesses during reconstruction or \"chattiness\"in a client-server architecture. Within the sequence of chunks that an object (file) is decomposed into, sub-sequences of adjacent chunks tend to repeat. We exploit this insight to optimize the chunk sizes by joining repeated sub-sequences of small chunks into new ``super chunks'' with the constraint to achieve practically the same matching performance. We employ suffix arrays to find these repeating sub-sequences and to determine a new encoding that covers the original sequence.With super chunks we significantly reduce fragmentation, improving reconstruction time and the overall deduplication ratio by lowering the amount of metadata (fewer hashes and dictionary entries).","PeriodicalId":377880,"journal":{"name":"2009 Data Compression Conference","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Block Size Optimization in Deduplication Systems\",\"authors\":\"C. Constantinescu, J. Pieper, Tiancheng Li\",\"doi\":\"10.1109/DCC.2009.51\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data deduplication is a popular dictionary based compression method in storage archival and backup.The deduplication efficiency (``chunk'' matching) improves for smaller chunk sizes, however the files become highly fragmented requiring many disk accesses during reconstruction or \\\"chattiness\\\"in a client-server architecture. Within the sequence of chunks that an object (file) is decomposed into, sub-sequences of adjacent chunks tend to repeat. We exploit this insight to optimize the chunk sizes by joining repeated sub-sequences of small chunks into new ``super chunks'' with the constraint to achieve practically the same matching performance. We employ suffix arrays to find these repeating sub-sequences and to determine a new encoding that covers the original sequence.With super chunks we significantly reduce fragmentation, improving reconstruction time and the overall deduplication ratio by lowering the amount of metadata (fewer hashes and dictionary entries).\",\"PeriodicalId\":377880,\"journal\":{\"name\":\"2009 Data Compression Conference\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-03-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Data Compression Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.2009.51\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2009.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

重复数据删除是存储归档和备份中常用的基于字典的压缩方法。对于较小的块大小，重复数据删除效率(“块”匹配)会得到提高，但是，在重建或客户机-服务器架构中的“聊天”期间，文件变得高度碎片化，需要多次磁盘访问。在对象(文件)被分解成的块序列中，相邻块的子序列往往会重复。我们利用这种洞察力来优化块大小，通过约束将小块的重复子序列连接到新的“超级块”中，以实现几乎相同的匹配性能。我们使用后缀数组来查找这些重复的子序列，并确定覆盖原始序列的新编码。使用超级块，我们可以通过降低元数据的数量(更少的哈希和字典条目)来显著减少碎片，改善重建时间和整体重复数据删除比率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Block Size Optimization in Deduplication Systems

Data deduplication is a popular dictionary based compression method in storage archival and backup.The deduplication efficiency (``chunk'' matching) improves for smaller chunk sizes, however the files become highly fragmented requiring many disk accesses during reconstruction or "chattiness"in a client-server architecture. Within the sequence of chunks that an object (file) is decomposed into, sub-sequences of adjacent chunks tend to repeat. We exploit this insight to optimize the chunk sizes by joining repeated sub-sequences of small chunks into new ``super chunks'' with the constraint to achieve practically the same matching performance. We employ suffix arrays to find these repeating sub-sequences and to determine a new encoding that covers the original sequence.With super chunks we significantly reduce fragmentation, improving reconstruction time and the overall deduplication ratio by lowering the amount of metadata (fewer hashes and dictionary entries).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 Data Compression Conference

自引率

0.00%

发文量

期刊最新文献

Analog Joint Source Channel Coding Using Space-Filling Curves and MMSE Decoding Tree Histogram Coding for Mobile Image Matching Clustered Reversible-KLT for Progressive Lossy-to-Lossless 3d Image Coding Optimized Source-Channel Coding of Video Signals in Packet Loss Environments New Families and New Members of Integer Sequence Based Coding Methods