实体对账中基于成本的高效人机重复数据删除模型

2018 5th International Conference on Systems and Informatics (ICSAI) Pub Date : 2018-11-01 DOI:10.1109/ICSAI.2018.8599375

C. R. Haruna, Mengshu Hou, M. J. Eghan, Michael Y. Kpiebaareh, Lawrence Tandoh, Barbie Eghan-Yartel, Maame G. Asante-Mensah

{"title":"实体对账中基于成本的高效人机重复数据删除模型","authors":"C. R. Haruna, Mengshu Hou, M. J. Eghan, Michael Y. Kpiebaareh, Lawrence Tandoh, Barbie Eghan-Yartel, Maame G. Asante-Mensah","doi":"10.1109/ICSAI.2018.8599375","DOIUrl":null,"url":null,"abstract":"In real world, databases often have several records representing the same entity and these duplicates have no common key, thus making deduplication difficult. Machine-based and crowdsourcing techniques were disjointly used in improving quality in data deduplication. Crowdsourcing were used for solving tasks that the machine-based algorithms were not good at. Though, the crowds, compared with machines, provided relatively more accurate results, both platforms were slow in execution and hence expensive to implement. In this paper, a hybrid human-machine system was proposed where machines were firstly used on the data set before the humans were further used to identify potential duplicates. We performed experiments using three benchmark datasets; paper, restaurant and product datasets. Our algorithm was compared with some existing techniques and our approach outperformed some methods by achieving a high accuracy of deduplication and good deduplication efficiency while incurring low crowdsourcing costs.","PeriodicalId":375852,"journal":{"name":"2018 5th International Conference on Systems and Informatics (ICSAI)","volume":"178 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Cost-Based and Effective Human-Machine Based Data Deduplication Model in Entity Reconciliation\",\"authors\":\"C. R. Haruna, Mengshu Hou, M. J. Eghan, Michael Y. Kpiebaareh, Lawrence Tandoh, Barbie Eghan-Yartel, Maame G. Asante-Mensah\",\"doi\":\"10.1109/ICSAI.2018.8599375\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In real world, databases often have several records representing the same entity and these duplicates have no common key, thus making deduplication difficult. Machine-based and crowdsourcing techniques were disjointly used in improving quality in data deduplication. Crowdsourcing were used for solving tasks that the machine-based algorithms were not good at. Though, the crowds, compared with machines, provided relatively more accurate results, both platforms were slow in execution and hence expensive to implement. In this paper, a hybrid human-machine system was proposed where machines were firstly used on the data set before the humans were further used to identify potential duplicates. We performed experiments using three benchmark datasets; paper, restaurant and product datasets. Our algorithm was compared with some existing techniques and our approach outperformed some methods by achieving a high accuracy of deduplication and good deduplication efficiency while incurring low crowdsourcing costs.\",\"PeriodicalId\":375852,\"journal\":{\"name\":\"2018 5th International Conference on Systems and Informatics (ICSAI)\",\"volume\":\"178 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 5th International Conference on Systems and Informatics (ICSAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSAI.2018.8599375\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 5th International Conference on Systems and Informatics (ICSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSAI.2018.8599375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

在现实世界中，数据库通常有代表同一实体的多条记录，这些重复的记录没有公共键，因此使重复数据删除变得困难。机器技术和众包技术被分离用于提高重复数据删除的质量。众包被用于解决基于机器的算法不擅长的任务。虽然，与机器相比，人群提供了相对更准确的结果，但这两个平台的执行速度都很慢，因此实现成本很高。本文提出了一种混合的人机系统，首先在数据集上使用机器，然后再使用人类来识别潜在的重复。我们使用三个基准数据集进行实验;纸张，餐厅和产品数据集。将我们的算法与现有的一些技术进行了比较，结果表明，我们的方法在低众包成本的情况下实现了高的重复数据删除精度和良好的重复数据删除效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cost-Based and Effective Human-Machine Based Data Deduplication Model in Entity Reconciliation

In real world, databases often have several records representing the same entity and these duplicates have no common key, thus making deduplication difficult. Machine-based and crowdsourcing techniques were disjointly used in improving quality in data deduplication. Crowdsourcing were used for solving tasks that the machine-based algorithms were not good at. Though, the crowds, compared with machines, provided relatively more accurate results, both platforms were slow in execution and hence expensive to implement. In this paper, a hybrid human-machine system was proposed where machines were firstly used on the data set before the humans were further used to identify potential duplicates. We performed experiments using three benchmark datasets; paper, restaurant and product datasets. Our algorithm was compared with some existing techniques and our approach outperformed some methods by achieving a high accuracy of deduplication and good deduplication efficiency while incurring low crowdsourcing costs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 5th International Conference on Systems and Informatics (ICSAI)

自引率

0.00%

发文量