LIPA: A Learning-based Indexing and Prefetching Approach for Data Deduplication

2019 35th Symposium on Mass Storage Systems and Technologies (MSST) Pub Date : 2019-05-20 DOI:10.1109/MSST.2019.00010

Guangping Xu, Bo Tang, Hongli Lu, Quan Yu, C. Sung

{"title":"LIPA: A Learning-based Indexing and Prefetching Approach for Data Deduplication","authors":"Guangping Xu, Bo Tang, Hongli Lu, Quan Yu, C. Sung","doi":"10.1109/MSST.2019.00010","DOIUrl":null,"url":null,"abstract":"In this paper, we present a learning based data deduplication algorithm, called LIPA, which uses the reinforcement learning framework to build an adaptive indexing structure. It is rather different from previous inline chunk-based deduplication methods to solve the chunk-lookup disk bottleneck problem for large-scale backup. In previous methods, a full chunk index or a sampled chunk index often is often required to identify duplicate chunks, which is a critical stage for data deduplication. The full chunk index is hard to fit in RAM and the sampled chunk index directly affects the deduplication ratio dependent on the sampling ratio. Our learning based method only requires little memory overheads to store the index but achieves the same or even better deduplication ratio than previous methods. In our method, after the data stream is broken into relatively large segments, one or more representative chunk fingerprints are chosen as the feature of a segment. An incoming segment may share the same feature with previous segments. Thus we use a key-value structure to record the relationship between features and segments: a feature maps to a fixed number of segments. We train the similarities of these segments to a feature represented as scores by the reinforcement learning method. For an incoming segment, our method adaptively prefetches a segment and the successive ones into cache by using multi-armed bandits model. Our experimental results show that our method significantly reduces memory overheads and achieves effective deduplication.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSST.2019.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

In this paper, we present a learning based data deduplication algorithm, called LIPA, which uses the reinforcement learning framework to build an adaptive indexing structure. It is rather different from previous inline chunk-based deduplication methods to solve the chunk-lookup disk bottleneck problem for large-scale backup. In previous methods, a full chunk index or a sampled chunk index often is often required to identify duplicate chunks, which is a critical stage for data deduplication. The full chunk index is hard to fit in RAM and the sampled chunk index directly affects the deduplication ratio dependent on the sampling ratio. Our learning based method only requires little memory overheads to store the index but achieves the same or even better deduplication ratio than previous methods. In our method, after the data stream is broken into relatively large segments, one or more representative chunk fingerprints are chosen as the feature of a segment. An incoming segment may share the same feature with previous segments. Thus we use a key-value structure to record the relationship between features and segments: a feature maps to a fixed number of segments. We train the similarities of these segments to a feature represented as scores by the reinforcement learning method. For an incoming segment, our method adaptively prefetches a segment and the successive ones into cache by using multi-armed bandits model. Our experimental results show that our method significantly reduces memory overheads and achieves effective deduplication.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于学习的重复数据删除索引和预取方法

在本文中，我们提出了一种基于学习的重复数据删除算法，称为LIPA，它使用强化学习框架来构建自适应索引结构。它与以往基于内联块的重复数据删除方法解决大规模备份的块查找磁盘瓶颈问题有很大的不同。在以前的方法中，经常需要使用全块索引或采样块索引来识别重复的块，这是重复数据删除的关键阶段。全块索引很难装入RAM，并且采样的块索引直接影响依赖于采样比率的重复数据删除比率。我们基于学习的方法只需要很少的内存开销来存储索引，但实现了与以前的方法相同甚至更好的重复数据删除比率。在我们的方法中，将数据流分成相对较大的段后，选择一个或多个具有代表性的块指纹作为段的特征。一个传入段可能与之前的段共享相同的特征。因此，我们使用键值结构来记录特征和段之间的关系:一个特征映射到固定数量的段。我们通过强化学习方法将这些片段的相似性训练为分数表示的特征。该方法采用多臂强盗模型，自适应地将一段和后续的段预取到缓存中。实验结果表明，该方法显著降低了内存开销，实现了有效的重复数据删除。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

自引率

0.00%

发文量