Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves
{"title":"Tuning large scale deduplication with reduced effort","authors":"Guilherme Dal Bianco, R. Galante, C. Heuser, Marcos André Gonçalves","doi":"10.1145/2484838.2484873","DOIUrl":null,"url":null,"abstract":"Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"32 1","pages":"18:1-18:12"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484838.2484873","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Deduplication is the task of identifying which objects are potentially the same in a data repository. It usually demands user intervention in several steps of the process, mainly to identify some pairs representing matchings and non-matchings. This information is then used to help in identifying other potentially duplicated records. When deduplication is applied to very large datasets, the performance and matching quality depends on expert users to configure the most important steps of the process (e.g., blocking and classification). In this paper, we propose a new framework called FS-Dedup able to help tuning the deduplication process on large datasets with a reduced effort from the user, who is only required to label a small, automatically selected, subset of pairs. FS-Dedup exploits Signature-Based Deduplication (Sig-Dedup) algorithms in its deduplication core. Sig-Dedup is characterized by high efficiency and scalability in large datasets but requires an expert user to tune several parameters. FS-Dedup helps in solving this drawback by providing a framework that does not demand specialized user knowledge about the dataset or thresholds to produce high effectiveness. Our evaluation over large real and synthetic datasets (containing millions of records) shows that FS-Dedup is able to reach or even surpass the maximal matching quality obtained by Sig-Dedup techniques with a reduced manual effort from the user.