Regression with linked datasets subject to linkage error

IF 4.4 2区 数学 Q1 STATISTICS & PROBABILITY Wiley Interdisciplinary Reviews-Computational Statistics Pub Date : 2021-09-08 DOI:10.1002/wics.1570
Zhenbang Wang, E. Ben-David, G. Diao, M. Slawski
{"title":"Regression with linked datasets subject to linkage error","authors":"Zhenbang Wang, E. Ben-David, G. Diao, M. Slawski","doi":"10.1002/wics.1570","DOIUrl":null,"url":null,"abstract":"Data are often collected from multiple heterogeneous sources and are combined subsequently. In combing data, record linkage is an essential task for linking records in datasets that refer to the same entity. Record linkage is generally not error‐free; there is a possibility that records belonging to different entities are linked or that records belonging to the same entity are missed. It is not advisable to simply ignore such errors because they can lead to data contamination and introduce bias in sample selection or estimation, which, in return, can lead to misleading statistical results and conclusions. For a long while, this problem was not properly recognized, but in recent years a growing number of researchers have developed methodology for dealing with linkage errors in regression analysis with linked datasets. The main goal of this overview is to give an account of those developments, with an emphasis on recent approaches and their connection to the so‐called “Broken Sample” problem. We also provide a short empirical study that illustrates the efficacy of corrective methods in different scenarios.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2021-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wiley Interdisciplinary Reviews-Computational Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1002/wics.1570","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 9

Abstract

Data are often collected from multiple heterogeneous sources and are combined subsequently. In combing data, record linkage is an essential task for linking records in datasets that refer to the same entity. Record linkage is generally not error‐free; there is a possibility that records belonging to different entities are linked or that records belonging to the same entity are missed. It is not advisable to simply ignore such errors because they can lead to data contamination and introduce bias in sample selection or estimation, which, in return, can lead to misleading statistical results and conclusions. For a long while, this problem was not properly recognized, but in recent years a growing number of researchers have developed methodology for dealing with linkage errors in regression analysis with linked datasets. The main goal of this overview is to give an account of those developments, with an emphasis on recent approaches and their connection to the so‐called “Broken Sample” problem. We also provide a short empirical study that illustrates the efficacy of corrective methods in different scenarios.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
存在链接错误的链接数据集的回归
数据通常是从多个异构来源收集的,然后进行组合。在梳理数据时,记录链接是链接引用同一实体的数据集中的记录的一项重要任务。记录链接通常不是无错误的;存在属于不同实体的记录被链接或者属于同一实体的记录丢失的可能性。简单地忽略这些错误是不可取的,因为它们可能导致数据污染,并在样本选择或估计中引入偏差,反过来,这可能导致误导性的统计结果和结论。很长一段时间以来,这个问题没有得到正确的认识,但近年来,越来越多的研究人员开发了处理关联数据集回归分析中的关联误差的方法。本概述的主要目标是介绍这些发展,重点介绍最近的方法及其与所谓的“破碎样本”问题的联系。我们还提供了一项简短的实证研究,说明了纠正方法在不同情况下的疗效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.20
自引率
0.00%
发文量
31
期刊最新文献
Neuroimaging statistical approaches for determining neural correlates of Alzheimer's disease via positron emission tomography imaging. A spectrum of explainable and interpretable machine learning approaches for genomic studies Functional neuroimaging in the era of Big Data and Open Science: A modern overview Information criteria for model selection Data Integration in Causal Inference.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1