TAILOR: a record linkage toolbox

Proceedings 18th International Conference on Data Engineering Pub Date : 2002-08-07 DOI:10.1109/ICDE.2002.994694

Mohamed G. Elfeky, A. Elmagarmid, Vassilios S. Verykios

{"title":"TAILOR: a record linkage toolbox","authors":"Mohamed G. Elfeky, A. Elmagarmid, Vassilios S. Verykios","doi":"10.1109/ICDE.2002.994694","DOIUrl":null,"url":null,"abstract":"Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for \"RecOrd LInkAge Toolbox\"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"337","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 18th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2002.994694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 337

Abstract

Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for "RecOrd LInkAge Toolbox"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一个记录链接工具箱

数据清理是确保实际数据库中存储的数据质量的重要过程。在数据库的知识发现、数据仓库、系统集成和电子服务等研究领域，经常遇到数据清理问题。识别表示同一实体(重复记录)的记录对的过程，通常称为记录链接，是数据清理的基本要素之一。在本文中，我们通过采用机器学习方法来解决记录链接问题。提出了三种模型，并进行了实证分析。由于没有现有的模型，包括本文中提出的模型，被证明是优越的，我们开发了一个交互式记录链接工具箱，名为TAILOR(“记录链接工具箱”的倒写缩写)。TAILOR的用户可以通过调整系统参数和插入内部开发的和公共领域的工具来构建他们自己的记录链接模型。建议的工具箱用作记录链接过程的框架，并以可扩展的方式设计，以与现有和未来的记录链接模型进行接口。我们进行了广泛的实验研究，不仅使用合成数据，而且使用真实数据来评估我们提出的模型。结果表明，所提出的机器学习记录链接模型在准确性和性能上都优于现有的记录链接模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings 18th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Out from under the trees [linear file template] Declarative composition and peer-to-peer provisioning of dynamic Web services Multivariate time series prediction via temporal classification Integrating workflow management systems with business-to-business interaction standards YFilter: efficient and scalable filtering of XML documents