High-Value Token-Blocking: Efficient Blocking Method for Record Linkage

K. O'Hare, Anna Jurek-Loughrey, Cassio P. de Campos
{"title":"High-Value Token-Blocking: Efficient Blocking Method for Record Linkage","authors":"K. O'Hare, Anna Jurek-Loughrey, Cassio P. de Campos","doi":"10.1145/3450527","DOIUrl":null,"url":null,"abstract":"Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data (TKDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3450527","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高值令牌阻塞:记录链接的有效阻塞方法
数据集成是大数据分析的重要组成部分。数据集成中的关键挑战之一是记录链接,即匹配代表相同现实世界实体的记录。由于计算成本,称为阻塞的方法被用作记录链接管道的一部分,以减少记录之间的比较次数。在过去的十年中,已经提出了一系列的阻塞技术。现实世界的应用程序需要能够处理异构数据源并且不依赖于标记数据的方法。我们提出了高价值令牌阻塞(HVTB),这是一种简单有效的无监督和模式无关的阻塞方法,基于精心使用术语频率-逆文档频率。我们将HVTB与多种方法和一系列数据集进行比较,包括一个由科学论文标题和摘要组成的新型非结构化数据集。我们将从准确性、计算资源的使用以及数据集和记录的不同特征等方面全面讨论结果。与现有方法相比,HVTB的简单性使计算速度更快,而且不影响其准确性。它明显优于其他方法,这表明在采用更复杂的方法之前,应考虑更简单的方法进行阻塞。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
相关文献
Environmental bacterial and fungal contamination in high touch surfaces and indoor air of a paediatric intensive care unit in Maputo Central Hospital, Mozambique in 2018
IF 0 Infection Prevention in PracticePub Date : 2022-12-01 DOI: 10.1016/j.infpip.2022.100250
Vânia Maphossa , José Carlos Langa , Samuel Simbine , Fabião Edmundo Maússe , Darlene Kenga , Ventura Relvas , Valéria Chicamba , Alice Manjate , Jahit Sacarlal
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Machine Learning-based Short-term Rainfall Prediction from Sky Data Incremental Feature Spaces Learning with Label Scarcity Multi-objective Learning to Overcome Catastrophic Forgetting in Time-series Applications Combining Filtering and Cross-Correlation Efficiently for Streaming Time Series Segment-Wise Time-Varying Dynamic Bayesian Network with Graph Regularization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1