{"title":"Whole-File Chunk-Based Deduplication Using Reinforcement Learning for Cloud Storage","authors":"Xincheng Yuan, M. Moh, Teng-Sheng Moh","doi":"10.1109/ASONAM55673.2022.10068661","DOIUrl":null,"url":null,"abstract":"Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc. It is commonly performed as part of data preprocessing to eliminate redundant data that requires extra storage spaces and computing power and is crucial for data storage management in cloud computing. Deduplication is essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period such as daily. A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. This paper explores the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. The proposed system is named SegDup, which achieves 13% higher deduplication ratio than Extreme Binning, a state-of-the art deduplication algorithm.","PeriodicalId":423113,"journal":{"name":"2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASONAM55673.2022.10068661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc. It is commonly performed as part of data preprocessing to eliminate redundant data that requires extra storage spaces and computing power and is crucial for data storage management in cloud computing. Deduplication is essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period such as daily. A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. This paper explores the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. The proposed system is named SegDup, which achieves 13% higher deduplication ratio than Extreme Binning, a state-of-the art deduplication algorithm.