TAILOR: a record linkage toolbox

Mohamed G. Elfeky, A. Elmagarmid, Vassilios S. Verykios
{"title":"TAILOR: a record linkage toolbox","authors":"Mohamed G. Elfeky, A. Elmagarmid, Vassilios S. Verykios","doi":"10.1109/ICDE.2002.994694","DOIUrl":null,"url":null,"abstract":"Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for \"RecOrd LInkAge Toolbox\"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"337","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 18th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2002.994694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 337

Abstract

Data cleaning is a vital process that ensures the quality of data stored in real-world databases. Data cleaning problems are frequently encountered in many research areas, such as knowledge discovery in databases, data warehousing, system integration and e-services. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. In this paper, we address the record linkage problem by adopting a machine learning approach. Three models are proposed and are analyzed empirically. Since no existing model, including those proposed in this paper, has been proved to be superior, we have developed an interactive record linkage toolbox named TAILOR (backwards acronym for "RecOrd LInkAge Toolbox"). Users of TAILOR can build their own record linkage models by tuning system parameters and by plugging in in-house-developed and public-domain tools. The proposed toolbox serves as a framework for the record linkage process, and is designed in an extensible way to interface with existing and future record linkage models. We have conducted an extensive experimental study to evaluate our proposed models using not only synthetic but also real data. The results show that the proposed machine-learning record linkage models outperform the existing ones both in accuracy and in performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一个记录链接工具箱
数据清理是确保实际数据库中存储的数据质量的重要过程。在数据库的知识发现、数据仓库、系统集成和电子服务等研究领域,经常遇到数据清理问题。识别表示同一实体(重复记录)的记录对的过程,通常称为记录链接,是数据清理的基本要素之一。在本文中,我们通过采用机器学习方法来解决记录链接问题。提出了三种模型,并进行了实证分析。由于没有现有的模型,包括本文中提出的模型,被证明是优越的,我们开发了一个交互式记录链接工具箱,名为TAILOR(“记录链接工具箱”的倒写缩写)。TAILOR的用户可以通过调整系统参数和插入内部开发的和公共领域的工具来构建他们自己的记录链接模型。建议的工具箱用作记录链接过程的框架,并以可扩展的方式设计,以与现有和未来的记录链接模型进行接口。我们进行了广泛的实验研究,不仅使用合成数据,而且使用真实数据来评估我们提出的模型。结果表明,所提出的机器学习记录链接模型在准确性和性能上都优于现有的记录链接模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Out from under the trees [linear file template] Declarative composition and peer-to-peer provisioning of dynamic Web services Multivariate time series prediction via temporal classification Integrating workflow management systems with business-to-business interaction standards YFilter: efficient and scalable filtering of XML documents
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1