Variable selection for latent class analysis in the presence of missing data with application to record linkage

IF 1.6 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Statistical Methods in Medical Research Pub Date : 2024-04-09 DOI:10.1177/09622802241242317
Huiping Xu, Xiaochun Li, Zuoyi Zhang, Shaun Grannis
{"title":"Variable selection for latent class analysis in the presence of missing data with application to record linkage","authors":"Huiping Xu, Xiaochun Li, Zuoyi Zhang, Shaun Grannis","doi":"10.1177/09622802241242317","DOIUrl":null,"url":null,"abstract":"The Fellegi-Sunter model is a latent class model widely used in probabilistic linkage to identify records that belong to the same entity. Record linkage practitioners typically employ all available matching fields in the model with the premise that more fields convey greater information about the true match status and hence result in improved match performance. In the context of model-based clustering, it is well known that such a premise is incorrect and the inclusion of noisy variables could compromise the clustering. Variable selection procedures have therefore been developed to remove noisy variables. Although these procedures have the potential to improve record matching, they cannot be applied directly due to the ubiquity of the missing data in record linkage applications. In this paper, we modify the stepwise variable selection procedure proposed by Fop, Smart, and Murphy and extend it to account for missing data common in record linkage. Through simulation studies, our proposed method is shown to select the correct set of matching fields across various settings, leading to better-performing algorithms. The improved match performance is also seen in a real-world application. We therefore recommend the use of our proposed selection procedure to identify informative matching fields for probabilistic record linkage algorithms.","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":"62 1","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Methods in Medical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09622802241242317","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

The Fellegi-Sunter model is a latent class model widely used in probabilistic linkage to identify records that belong to the same entity. Record linkage practitioners typically employ all available matching fields in the model with the premise that more fields convey greater information about the true match status and hence result in improved match performance. In the context of model-based clustering, it is well known that such a premise is incorrect and the inclusion of noisy variables could compromise the clustering. Variable selection procedures have therefore been developed to remove noisy variables. Although these procedures have the potential to improve record matching, they cannot be applied directly due to the ubiquity of the missing data in record linkage applications. In this paper, we modify the stepwise variable selection procedure proposed by Fop, Smart, and Murphy and extend it to account for missing data common in record linkage. Through simulation studies, our proposed method is shown to select the correct set of matching fields across various settings, leading to better-performing algorithms. The improved match performance is also seen in a real-world application. We therefore recommend the use of our proposed selection procedure to identify informative matching fields for probabilistic record linkage algorithms.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
缺失数据情况下的潜类分析变量选择与记录关联的应用
Fellegi-Sunter 模型是一种潜类模型,被广泛应用于概率链接,以识别属于同一实体的记录。记录关联实践者通常会在模型中使用所有可用的匹配字段,前提是更多的字段能传递更多关于真实匹配状态的信息,从而提高匹配性能。众所周知,在基于模型的聚类中,这样的前提是不正确的,包含噪声变量会影响聚类效果。因此,我们开发了变量选择程序来去除噪声变量。虽然这些程序有改善记录匹配的潜力,但由于记录关联应用中缺失数据的普遍性,这些程序无法直接应用。在本文中,我们修改了 Fop、Smart 和 Murphy 提出的逐步变量选择程序,并对其进行了扩展,以考虑记录关联中常见的缺失数据。通过模拟研究,我们提出的方法可以在各种情况下选择正确的匹配字段集,从而产生性能更好的算法。在实际应用中,我们也看到了匹配性能的提高。因此,我们建议使用我们提出的选择程序来为概率记录关联算法识别信息匹配字段。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Statistical Methods in Medical Research
Statistical Methods in Medical Research 医学-数学与计算生物学
CiteScore
4.10
自引率
4.30%
发文量
127
审稿时长
>12 weeks
期刊介绍: Statistical Methods in Medical Research is a peer reviewed scholarly journal and is the leading vehicle for articles in all the main areas of medical statistics and an essential reference for all medical statisticians. This unique journal is devoted solely to statistics and medicine and aims to keep professionals abreast of the many powerful statistical techniques now available to the medical profession. This journal is a member of the Committee on Publication Ethics (COPE)
期刊最新文献
LASSO-type instrumental variable selection methods with an application to Mendelian randomization. Estimating an adjusted risk difference in a cluster randomized trial with individual-level analyses. Testing for a treatment effect in a selected subgroup. Enhancing DHA supplementation adherence: A Bayesian approach with finite mixture models and irregular interim schedules in adaptive trial designs. Analysis of recurrent event data with spatial random effects using a Bayesian approach.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1