Hamid Naceur Benkhaled, Djamel Berrabah, F. Boufarès
{"title":"Block Sizes Control For an Efficient Real Time Record Linkage","authors":"Hamid Naceur Benkhaled, Djamel Berrabah, F. Boufarès","doi":"10.1109/CloudTech49835.2020.9365866","DOIUrl":null,"url":null,"abstract":"Record Linkage (RL) is the process of detecting duplicates in one or several datasets. The main important phase during the RL process is blocking, it reduces the quadratic complexity of the RL process by dividing the data into several blocks, in which, matching between the records is done. Several blocking techniques were proposed in the literature, but most of them do not have a mechanism of controlling the generated block sizes, which is a very important condition in the field of real-time RL or privacy-preserving RL. In this paper, we propose a mechanism to control the block sizes generated by the K-Modes based Record Linkage. The experiments done on three real-world datasets show satisfying results where most of the duplicates records were detected while respecting the specified block sizes.","PeriodicalId":272860,"journal":{"name":"2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudTech49835.2020.9365866","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Record Linkage (RL) is the process of detecting duplicates in one or several datasets. The main important phase during the RL process is blocking, it reduces the quadratic complexity of the RL process by dividing the data into several blocks, in which, matching between the records is done. Several blocking techniques were proposed in the literature, but most of them do not have a mechanism of controlling the generated block sizes, which is a very important condition in the field of real-time RL or privacy-preserving RL. In this paper, we propose a mechanism to control the block sizes generated by the K-Modes based Record Linkage. The experiments done on three real-world datasets show satisfying results where most of the duplicates records were detected while respecting the specified block sizes.