{"title":"Enhanced Duplicate Count Strategy: Towards New Algorithms to Improve Duplicate Detection","authors":"Y. Aassem, I. Hafidi, N. Aboutabit","doi":"10.1145/3386723.3387877","DOIUrl":null,"url":null,"abstract":"Duplicate detection is the process of detecting multiple representations of same real world's entities. Nowadays, data is known to be heterogeneous, and the larger it becomes, the pairwise comparisons number grows highly as well which makes the task more complex. In recent years, many approaches have been developed and attempted to reduce the number of record pair's comparisons in the process while maintaining high matching quality. There are two well-known algorithms, which are the Sorted Neighborhood Method (SNM) and the Blocking algorithms. Being inspired by both algorithms, we propose an Enhanced Duplicate Count Strategy which is a new hybrid approach that creates iterative blocks using windows with dynamic size. It is based on comparing next element with last duplicate found in the current window. Consequently, comparisons are saved, and similarity distance is minimized, which can lead to higher matching quality.","PeriodicalId":139072,"journal":{"name":"Proceedings of the 3rd International Conference on Networking, Information Systems & Security","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Networking, Information Systems & Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3386723.3387877","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Duplicate detection is the process of detecting multiple representations of same real world's entities. Nowadays, data is known to be heterogeneous, and the larger it becomes, the pairwise comparisons number grows highly as well which makes the task more complex. In recent years, many approaches have been developed and attempted to reduce the number of record pair's comparisons in the process while maintaining high matching quality. There are two well-known algorithms, which are the Sorted Neighborhood Method (SNM) and the Blocking algorithms. Being inspired by both algorithms, we propose an Enhanced Duplicate Count Strategy which is a new hybrid approach that creates iterative blocks using windows with dynamic size. It is based on comparing next element with last duplicate found in the current window. Consequently, comparisons are saved, and similarity distance is minimized, which can lead to higher matching quality.