{"title":"Deduplication using Modified Dynamic File Chunking for Big Data Mining","authors":"Saja Taha Ahmed","doi":"10.12785/ijcds/160105","DOIUrl":null,"url":null,"abstract":": The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.","PeriodicalId":37180,"journal":{"name":"International Journal of Computing and Digital Systems","volume":"91 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computing and Digital Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12785/ijcds/160105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
: The unpredictability of data growth necessitates data management to make optimum use of storage capacity. An innovative strategy for data deduplication is suggested in this study. The file is split into blocks of a predefined size by the predefined-size DeDuplication algorithm. The primary problem with this strategy is that the preceding sections will be relocated from their original placements if additional sections are inserted into the forefront or center of a file. As a result, the generated chunks will have a new hash value, resulting in a lower DeDuplication ratio. To overcome this drawback, this study suggests multiple characters as content-defined chunking breakpoints, which mostly depend on file internal representation and have variable chunk sizes. The experimental result shows significant improvement in the redundancy removal ratio of the Linux dataset. So, a comparison is made between the proposed fixed and dynamic deduplication stating that dynamic chunking has less average chunk size and can gain a much higher deduplication ratio.