Exploring Goldstein et al.’s Scalelink method of data linkage.

IF 2.2 Q3 HEALTH CARE SCIENCES & SERVICES International Journal of Population Data Science Pub Date : 2022-08-25 DOI:10.23889/ijpds.v7i3.2042

M. A. M. Cleaton, Josie Plachta, R. Shipsey

{"title":"Exploring Goldstein et al.’s Scalelink method of data linkage.","authors":"M. A. M. Cleaton, Josie Plachta, R. Shipsey","doi":"10.23889/ijpds.v7i3.2042","DOIUrl":null,"url":null,"abstract":"ObjectivesScalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test this method for the first time on real-world big data. \nApproachScalelink uses agreement states for each linkage variable and candidate pair. These are compared to determine how frequently, for all candidate pairs, any given agreement state is held at the same time as any other agreement state (this accounts for variable dependence). The results of this comparison are inputted into a loss function and the minimisation of this function is derived within constraints to produce weights. Currently, the method is accessible via Goldstein et al.’s paper and R package. We are translating it into PySpark to enable testing on datasets that are too large to link without using distributed computing. \nResultsInitial testing of Goldstein et al.’s Scalelink method on small samples of real-world datasets shows that it performs as expected for a probabilistic linkage method, although cannot currently deal with missingness. To test the quality of the method on real-world big data, a high-quality linked dataset of the 2021 England and Wales Census and follow-up Census Coverage Survey will be used as a Gold Standard (GS). After developing a method that enables Scalelink to deal with missingness, we will apply Scalelink and automatic Fellegi-Sunter probabilistic linkage to this GS. We can thus establish and compare the precision and recall of both methods. We will also investigate linkage bias for particular demographics, test computational efficiency and estimate the clerical review burden for each method. \nConclusionGoldstein et al.’s Scalelink algorithm shows promise as a high quality, scalable, dependence-free linkage algorithm for use in any matching project. Here, for the first time, we research the method’s quality and feasibility with real-world big data. From this we will produce recommendations regarding its utility.","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":" ","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v7i3.2042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

ObjectivesScalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test this method for the first time on real-world big data. ApproachScalelink uses agreement states for each linkage variable and candidate pair. These are compared to determine how frequently, for all candidate pairs, any given agreement state is held at the same time as any other agreement state (this accounts for variable dependence). The results of this comparison are inputted into a loss function and the minimisation of this function is derived within constraints to produce weights. Currently, the method is accessible via Goldstein et al.’s paper and R package. We are translating it into PySpark to enable testing on datasets that are too large to link without using distributed computing. ResultsInitial testing of Goldstein et al.’s Scalelink method on small samples of real-world datasets shows that it performs as expected for a probabilistic linkage method, although cannot currently deal with missingness. To test the quality of the method on real-world big data, a high-quality linked dataset of the 2021 England and Wales Census and follow-up Census Coverage Survey will be used as a Gold Standard (GS). After developing a method that enables Scalelink to deal with missingness, we will apply Scalelink and automatic Fellegi-Sunter probabilistic linkage to this GS. We can thus establish and compare the precision and recall of both methods. We will also investigate linkage bias for particular demographics, test computational efficiency and estimate the clerical review burden for each method. ConclusionGoldstein et al.’s Scalelink algorithm shows promise as a high quality, scalable, dependence-free linkage algorithm for use in any matching project. Here, for the first time, we research the method’s quality and feasibility with real-world big data. From this we will produce recommendations regarding its utility.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索Goldstein等人的Scalelink数据链接方法。

目的scalelink是一种基于对应分析的概率数据链接方法。与流行且广泛使用的Fellegi-Sunter算法不同，它不假设连杆变量独立。它还声称更直观，计算效率更高。我们的目标是在现实世界的大数据上首次测试这种方法。ApproachScalelink为每个链接变量和候选对使用一致状态。对它们进行比较，以确定对于所有候选对，任何给定的协议状态与任何其他协议状态同时保持的频率（这说明了变量依赖性）。该比较的结果被输入到损失函数中，并且该函数的最小化在约束内导出以产生权重。目前，该方法可通过Goldstein等人的论文和R包访问。我们正在将其转换为PySpark，以便在不使用分布式计算就无法链接的数据集上进行测试。结果Goldstein等人的Scalelink方法在真实世界数据集的小样本上的初步测试表明，它的性能与概率链接方法的预期一样，尽管目前无法处理缺失。为了在真实世界的大数据上测试该方法的质量，2021年英格兰和威尔士人口普查和后续人口普查覆盖率调查的高质量关联数据集将被用作黄金标准（GS）。在开发出一种使Scalelink能够处理缺失的方法后，我们将把Scalelink和自动Fellegi-Sunter概率链接应用于该GS。因此，我们可以建立并比较这两种方法的精度和召回率。我们还将调查特定人口统计学的联系偏差，测试计算效率，并估计每种方法的文书审查负担。结论Goldstein等人的Scalelink算法有望成为一种高质量、可扩展、无依赖的链接算法，可用于任何匹配项目。在这里，我们首次利用真实世界的大数据研究了该方法的质量和可行性。据此，我们将提出关于其效用的建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊