MAD2:用于网络备份服务的可扩展高吞吐量精确重复数据删除方法

2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) Pub Date : 2010-05-03 DOI:10.1109/MSST.2010.5496987

Jiansheng Wei, Hong Jiang, Ke Zhou, D. Feng

{"title":"MAD2:用于网络备份服务的可扩展高吞吐量精确重复数据删除方法","authors":"Jiansheng Wei, Hong Jiang, Ke Zhou, D. Feng","doi":"10.1109/MSST.2010.5496987","DOIUrl":null,"url":null,"abstract":"Deduplication has been widely used in disk-based secondary storage systems to improve space efficiency. However, there are two challenges facing scalable high-throughput deduplication storage. The first is the duplicate-lookup disk bottleneck due to the large size of data index that usually exceeds the available RAM space, which limits the deduplication throughput. The second is the storage node island effect resulting from duplicate data among multiple storage nodes that are difficult to eliminate. Existing approaches fail to completely eliminate the duplicates while simultaneously addressing the challenges. This paper proposes MAD2, a scalable high-throughput exact deduplication approach for network backup services. MAD2 eliminates duplicate data both at the file level and at the chunk level by employing four techniques to accelerate the deduplication process and evenly distribute data. First, MAD2 organizes fingerprints into a Hash Bucket Matrix (HBM), whose rows can be used to preserve the data locality in backups. Second, MAD2 uses Bloom Filter Array (BFA) as a quick index to quickly identify non-duplicate incoming data objects or indicate where to find a possible duplicate. Third, Dual Cache is integrated in MAD2 to effectively capture and exploit data locality. Finally, MAD2 employs a DHT-based Load-Balance technique to evenly distribute data objects among multiple storage nodes in their backup sequences to further enhance performance with a well-balanced load. We evaluate our MAD2 approach on the backend storage of B-Cloud, a research-oriented distributed system that provides network backup services. Experimental results show that MAD2 significantly outperforms the state-of-the-art approximate deduplication approaches in terms of deduplication efficiency, supporting a deduplication throughput of at least 100MB/s for each storage component.","PeriodicalId":350968,"journal":{"name":"2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"98","resultStr":"{\"title\":\"MAD2: A scalable high-throughput exact deduplication approach for network backup services\",\"authors\":\"Jiansheng Wei, Hong Jiang, Ke Zhou, D. Feng\",\"doi\":\"10.1109/MSST.2010.5496987\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deduplication has been widely used in disk-based secondary storage systems to improve space efficiency. However, there are two challenges facing scalable high-throughput deduplication storage. The first is the duplicate-lookup disk bottleneck due to the large size of data index that usually exceeds the available RAM space, which limits the deduplication throughput. The second is the storage node island effect resulting from duplicate data among multiple storage nodes that are difficult to eliminate. Existing approaches fail to completely eliminate the duplicates while simultaneously addressing the challenges. This paper proposes MAD2, a scalable high-throughput exact deduplication approach for network backup services. MAD2 eliminates duplicate data both at the file level and at the chunk level by employing four techniques to accelerate the deduplication process and evenly distribute data. First, MAD2 organizes fingerprints into a Hash Bucket Matrix (HBM), whose rows can be used to preserve the data locality in backups. Second, MAD2 uses Bloom Filter Array (BFA) as a quick index to quickly identify non-duplicate incoming data objects or indicate where to find a possible duplicate. Third, Dual Cache is integrated in MAD2 to effectively capture and exploit data locality. Finally, MAD2 employs a DHT-based Load-Balance technique to evenly distribute data objects among multiple storage nodes in their backup sequences to further enhance performance with a well-balanced load. We evaluate our MAD2 approach on the backend storage of B-Cloud, a research-oriented distributed system that provides network backup services. Experimental results show that MAD2 significantly outperforms the state-of-the-art approximate deduplication approaches in terms of deduplication efficiency, supporting a deduplication throughput of at least 100MB/s for each storage component.\",\"PeriodicalId\":350968,\"journal\":{\"name\":\"2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-05-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"98\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MSST.2010.5496987\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2010.5496987","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 98

摘要

重复数据删除技术被广泛应用于基于磁盘的二级存储系统中，以提高存储空间的使用效率。但是，可扩展的高吞吐量重复数据删除存储面临两个挑战。首先是重复查找磁盘瓶颈，这是由于数据索引的大小通常超过可用的RAM空间，从而限制了重复数据删除吞吐量。二是由于多个存储节点之间的重复数据难以消除而产生的存储节点孤岛效应。现有的方法无法在解决挑战的同时完全消除重复。本文提出了一种可扩展的、高吞吐量的精确重复数据删除方法MAD2，用于网络备份服务。MAD2通过采用四种技术来加速重复数据删除过程并均匀分布数据，从而消除了文件级和块级的重复数据。首先，MAD2将指纹组织到哈希桶矩阵(HBM)中，该矩阵的行可用于在备份中保留数据局部性。其次，MAD2使用Bloom Filter Array (BFA)作为快速索引来快速识别非重复的传入数据对象或指示在何处查找可能的重复。第三，MAD2集成了双缓存，有效地捕获和利用数据局部性。最后，MAD2采用基于dht的负载平衡技术，在多个存储节点的备份序列中均匀分布数据对象，从而通过负载均衡进一步提高性能。我们在B-Cloud的后端存储上评估了我们的MAD2方法，B-Cloud是一个提供网络备份服务的研究型分布式系统。实验结果表明，MAD2在重复数据删除效率方面明显优于最先进的近似重复数据删除方法，支持每个存储组件至少100MB/s的重复数据删除吞吐量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MAD2: A scalable high-throughput exact deduplication approach for network backup services

Deduplication has been widely used in disk-based secondary storage systems to improve space efficiency. However, there are two challenges facing scalable high-throughput deduplication storage. The first is the duplicate-lookup disk bottleneck due to the large size of data index that usually exceeds the available RAM space, which limits the deduplication throughput. The second is the storage node island effect resulting from duplicate data among multiple storage nodes that are difficult to eliminate. Existing approaches fail to completely eliminate the duplicates while simultaneously addressing the challenges. This paper proposes MAD2, a scalable high-throughput exact deduplication approach for network backup services. MAD2 eliminates duplicate data both at the file level and at the chunk level by employing four techniques to accelerate the deduplication process and evenly distribute data. First, MAD2 organizes fingerprints into a Hash Bucket Matrix (HBM), whose rows can be used to preserve the data locality in backups. Second, MAD2 uses Bloom Filter Array (BFA) as a quick index to quickly identify non-duplicate incoming data objects or indicate where to find a possible duplicate. Third, Dual Cache is integrated in MAD2 to effectively capture and exploit data locality. Finally, MAD2 employs a DHT-based Load-Balance technique to evenly distribute data objects among multiple storage nodes in their backup sequences to further enhance performance with a well-balanced load. We evaluate our MAD2 approach on the backend storage of B-Cloud, a research-oriented distributed system that provides network backup services. Experimental results show that MAD2 significantly outperforms the state-of-the-art approximate deduplication approaches in terms of deduplication efficiency, supporting a deduplication throughput of at least 100MB/s for each storage component.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)

自引率

0.00%

发文量

期刊最新文献

Automated lookahead data migration in SSD-enabled multi-tiered storage systems Write amplification reduction in NAND Flash through multi-write coding Leveraging disk drive acoustic modes for power management Achieving page-mapping FTL performance at block-mapping FTL cost by hiding address translation Energy and thermal aware buffer cache replacement algorithm