Yebing Luo, Tiezheng Nie, Derong Shen, Yue Kou, Ge Yu
{"title":"A Progressive Method for Detecting Duplication Entities Based on Bloom Filters","authors":"Yebing Luo, Tiezheng Nie, Derong Shen, Yue Kou, Ge Yu","doi":"10.1109/WISA.2017.11","DOIUrl":null,"url":null,"abstract":"With the volume of data grows rapidly, the cost of detecting duplication entities has increased significantly in data cleaning. However, some real-time applications only need to identify as many duplicate entities as possible in a limited time, rather than all of them. The existing works adopt the sorting method to divide similar records into blocks, and arrange the processing order of blocks to detect duplicate entity progressively. However, this method only works well when the attributes of records are suitable for sorting. Therefore, this paper proposes a novel progressive de-duplicate method for records that can't be sorted by their attributes. The method distributes records into different blocks based on their features and generates a modified bloom filter index for each block. Then it uses the bloom filter to predict the probability of duplicate entities in this block, which determines the processing order of blocks to detect the duplicate entities more quickly. The comprehensive experiment shows that the number of duplicate detection by this algorithm in the finite time is far more efficient than other algorithms involved in the related works.","PeriodicalId":204706,"journal":{"name":"2017 14th Web Information Systems and Applications Conference (WISA)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th Web Information Systems and Applications Conference (WISA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISA.2017.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the volume of data grows rapidly, the cost of detecting duplication entities has increased significantly in data cleaning. However, some real-time applications only need to identify as many duplicate entities as possible in a limited time, rather than all of them. The existing works adopt the sorting method to divide similar records into blocks, and arrange the processing order of blocks to detect duplicate entity progressively. However, this method only works well when the attributes of records are suitable for sorting. Therefore, this paper proposes a novel progressive de-duplicate method for records that can't be sorted by their attributes. The method distributes records into different blocks based on their features and generates a modified bloom filter index for each block. Then it uses the bloom filter to predict the probability of duplicate entities in this block, which determines the processing order of blocks to detect the duplicate entities more quickly. The comprehensive experiment shows that the number of duplicate detection by this algorithm in the finite time is far more efficient than other algorithms involved in the related works.