Exploring Goldstein et al.’s Scalelink method of data linkage.

IF 1.6 Q3 HEALTH CARE SCIENCES & SERVICES International Journal of Population Data Science Pub Date : 2022-08-25 DOI:10.23889/ijpds.v7i3.2042
M. A. M. Cleaton, Josie Plachta, R. Shipsey
{"title":"Exploring Goldstein et al.’s Scalelink method of data linkage.","authors":"M. A. M. Cleaton, Josie Plachta, R. Shipsey","doi":"10.23889/ijpds.v7i3.2042","DOIUrl":null,"url":null,"abstract":"ObjectivesScalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test this method for the first time on real-world big data. \nApproachScalelink uses agreement states for each linkage variable and candidate pair. These are compared to determine how frequently, for all candidate pairs, any given agreement state is held at the same time as any other agreement state (this accounts for variable dependence). The results of this comparison are inputted into a loss function and the minimisation of this function is derived within constraints to produce weights. Currently, the method is accessible via Goldstein et al.’s paper and R package. We are translating it into PySpark to enable testing on datasets that are too large to link without using distributed computing. \nResultsInitial testing of Goldstein et al.’s Scalelink method on small samples of real-world datasets shows that it performs as expected for a probabilistic linkage method, although cannot currently deal with missingness. To test the quality of the method on real-world big data, a high-quality linked dataset of the 2021 England and Wales Census and follow-up Census Coverage Survey will be used as a Gold Standard (GS). After developing a method that enables Scalelink to deal with missingness, we will apply Scalelink and automatic Fellegi-Sunter probabilistic linkage to this GS. We can thus establish and compare the precision and recall of both methods. We will also investigate linkage bias for particular demographics, test computational efficiency and estimate the clerical review burden for each method. \nConclusionGoldstein et al.’s Scalelink algorithm shows promise as a high quality, scalable, dependence-free linkage algorithm for use in any matching project. Here, for the first time, we research the method’s quality and feasibility with real-world big data. From this we will produce recommendations regarding its utility.","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v7i3.2042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

ObjectivesScalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test this method for the first time on real-world big data. ApproachScalelink uses agreement states for each linkage variable and candidate pair. These are compared to determine how frequently, for all candidate pairs, any given agreement state is held at the same time as any other agreement state (this accounts for variable dependence). The results of this comparison are inputted into a loss function and the minimisation of this function is derived within constraints to produce weights. Currently, the method is accessible via Goldstein et al.’s paper and R package. We are translating it into PySpark to enable testing on datasets that are too large to link without using distributed computing. ResultsInitial testing of Goldstein et al.’s Scalelink method on small samples of real-world datasets shows that it performs as expected for a probabilistic linkage method, although cannot currently deal with missingness. To test the quality of the method on real-world big data, a high-quality linked dataset of the 2021 England and Wales Census and follow-up Census Coverage Survey will be used as a Gold Standard (GS). After developing a method that enables Scalelink to deal with missingness, we will apply Scalelink and automatic Fellegi-Sunter probabilistic linkage to this GS. We can thus establish and compare the precision and recall of both methods. We will also investigate linkage bias for particular demographics, test computational efficiency and estimate the clerical review burden for each method. ConclusionGoldstein et al.’s Scalelink algorithm shows promise as a high quality, scalable, dependence-free linkage algorithm for use in any matching project. Here, for the first time, we research the method’s quality and feasibility with real-world big data. From this we will produce recommendations regarding its utility.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
探索Goldstein等人的Scalelink数据链接方法。
目的scalelink是一种基于对应分析的概率数据链接方法。与流行且广泛使用的Fellegi-Sunter算法不同,它不假设连杆变量独立。它还声称更直观,计算效率更高。我们的目标是在现实世界的大数据上首次测试这种方法。ApproachScalelink为每个链接变量和候选对使用一致状态。对它们进行比较,以确定对于所有候选对,任何给定的协议状态与任何其他协议状态同时保持的频率(这说明了变量依赖性)。该比较的结果被输入到损失函数中,并且该函数的最小化在约束内导出以产生权重。目前,该方法可通过Goldstein等人的论文和R包访问。我们正在将其转换为PySpark,以便在不使用分布式计算就无法链接的数据集上进行测试。结果Goldstein等人的Scalelink方法在真实世界数据集的小样本上的初步测试表明,它的性能与概率链接方法的预期一样,尽管目前无法处理缺失。为了在真实世界的大数据上测试该方法的质量,2021年英格兰和威尔士人口普查和后续人口普查覆盖率调查的高质量关联数据集将被用作黄金标准(GS)。在开发出一种使Scalelink能够处理缺失的方法后,我们将把Scalelink和自动Fellegi-Sunter概率链接应用于该GS。因此,我们可以建立并比较这两种方法的精度和召回率。我们还将调查特定人口统计学的联系偏差,测试计算效率,并估计每种方法的文书审查负担。结论Goldstein等人的Scalelink算法有望成为一种高质量、可扩展、无依赖的链接算法,可用于任何匹配项目。在这里,我们首次利用真实世界的大数据研究了该方法的质量和可行性。据此,我们将提出关于其效用的建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.50
自引率
0.00%
发文量
386
审稿时长
20 weeks
期刊最新文献
Defining a low-risk birth cohort: a cohort study comparing two perinatal data sets in Ontario, Canada. Data resource profile: nutrition data in the VA million veteran program. Deprivation effects on length of stay and death of hospitalised COVID-19 patients in Greater Manchester. Variation in colorectal cancer treatment and outcomes in Scotland: real world evidence from national linked administrative health data. Examining the quality and population representativeness of linked survey and administrative data: guidance and illustration using linked 1958 National Child Development Study and Hospital Episode Statistics data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1