C. R. Haruna, Mengshu Hou, M. J. Eghan, Michael Y. Kpiebaareh, Lawrence Tandoh, Barbie Eghan-Yartel, Maame G. Asante-Mensah
{"title":"实体对账中基于成本的高效人机重复数据删除模型","authors":"C. R. Haruna, Mengshu Hou, M. J. Eghan, Michael Y. Kpiebaareh, Lawrence Tandoh, Barbie Eghan-Yartel, Maame G. Asante-Mensah","doi":"10.1109/ICSAI.2018.8599375","DOIUrl":null,"url":null,"abstract":"In real world, databases often have several records representing the same entity and these duplicates have no common key, thus making deduplication difficult. Machine-based and crowdsourcing techniques were disjointly used in improving quality in data deduplication. Crowdsourcing were used for solving tasks that the machine-based algorithms were not good at. Though, the crowds, compared with machines, provided relatively more accurate results, both platforms were slow in execution and hence expensive to implement. In this paper, a hybrid human-machine system was proposed where machines were firstly used on the data set before the humans were further used to identify potential duplicates. We performed experiments using three benchmark datasets; paper, restaurant and product datasets. Our algorithm was compared with some existing techniques and our approach outperformed some methods by achieving a high accuracy of deduplication and good deduplication efficiency while incurring low crowdsourcing costs.","PeriodicalId":375852,"journal":{"name":"2018 5th International Conference on Systems and Informatics (ICSAI)","volume":"178 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Cost-Based and Effective Human-Machine Based Data Deduplication Model in Entity Reconciliation\",\"authors\":\"C. R. Haruna, Mengshu Hou, M. J. Eghan, Michael Y. Kpiebaareh, Lawrence Tandoh, Barbie Eghan-Yartel, Maame G. Asante-Mensah\",\"doi\":\"10.1109/ICSAI.2018.8599375\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In real world, databases often have several records representing the same entity and these duplicates have no common key, thus making deduplication difficult. Machine-based and crowdsourcing techniques were disjointly used in improving quality in data deduplication. Crowdsourcing were used for solving tasks that the machine-based algorithms were not good at. Though, the crowds, compared with machines, provided relatively more accurate results, both platforms were slow in execution and hence expensive to implement. In this paper, a hybrid human-machine system was proposed where machines were firstly used on the data set before the humans were further used to identify potential duplicates. We performed experiments using three benchmark datasets; paper, restaurant and product datasets. Our algorithm was compared with some existing techniques and our approach outperformed some methods by achieving a high accuracy of deduplication and good deduplication efficiency while incurring low crowdsourcing costs.\",\"PeriodicalId\":375852,\"journal\":{\"name\":\"2018 5th International Conference on Systems and Informatics (ICSAI)\",\"volume\":\"178 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 5th International Conference on Systems and Informatics (ICSAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSAI.2018.8599375\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 5th International Conference on Systems and Informatics (ICSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSAI.2018.8599375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Cost-Based and Effective Human-Machine Based Data Deduplication Model in Entity Reconciliation
In real world, databases often have several records representing the same entity and these duplicates have no common key, thus making deduplication difficult. Machine-based and crowdsourcing techniques were disjointly used in improving quality in data deduplication. Crowdsourcing were used for solving tasks that the machine-based algorithms were not good at. Though, the crowds, compared with machines, provided relatively more accurate results, both platforms were slow in execution and hence expensive to implement. In this paper, a hybrid human-machine system was proposed where machines were firstly used on the data set before the humans were further used to identify potential duplicates. We performed experiments using three benchmark datasets; paper, restaurant and product datasets. Our algorithm was compared with some existing techniques and our approach outperformed some methods by achieving a high accuracy of deduplication and good deduplication efficiency while incurring low crowdsourcing costs.