Regression with linked datasets subject to linkage error

IF 5.4 2区数学 Q1 STATISTICS & PROBABILITY Wiley Interdisciplinary Reviews-Computational Statistics Pub Date : 2021-09-08 DOI:10.1002/wics.1570

Zhenbang Wang, E. Ben-David, G. Diao, M. Slawski

{"title":"Regression with linked datasets subject to linkage error","authors":"Zhenbang Wang, E. Ben-David, G. Diao, M. Slawski","doi":"10.1002/wics.1570","DOIUrl":null,"url":null,"abstract":"Data are often collected from multiple heterogeneous sources and are combined subsequently. In combing data, record linkage is an essential task for linking records in datasets that refer to the same entity. Record linkage is generally not error‐free; there is a possibility that records belonging to different entities are linked or that records belonging to the same entity are missed. It is not advisable to simply ignore such errors because they can lead to data contamination and introduce bias in sample selection or estimation, which, in return, can lead to misleading statistical results and conclusions. For a long while, this problem was not properly recognized, but in recent years a growing number of researchers have developed methodology for dealing with linkage errors in regression analysis with linked datasets. The main goal of this overview is to give an account of those developments, with an emphasis on recent approaches and their connection to the so‐called “Broken Sample” problem. We also provide a short empirical study that illustrates the efficacy of corrective methods in different scenarios.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2021-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wiley Interdisciplinary Reviews-Computational Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1002/wics.1570","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 9

Abstract

Data are often collected from multiple heterogeneous sources and are combined subsequently. In combing data, record linkage is an essential task for linking records in datasets that refer to the same entity. Record linkage is generally not error‐free; there is a possibility that records belonging to different entities are linked or that records belonging to the same entity are missed. It is not advisable to simply ignore such errors because they can lead to data contamination and introduce bias in sample selection or estimation, which, in return, can lead to misleading statistical results and conclusions. For a long while, this problem was not properly recognized, but in recent years a growing number of researchers have developed methodology for dealing with linkage errors in regression analysis with linked datasets. The main goal of this overview is to give an account of those developments, with an emphasis on recent approaches and their connection to the so‐called “Broken Sample” problem. We also provide a short empirical study that illustrates the efficacy of corrective methods in different scenarios.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

存在链接错误的链接数据集的回归

数据通常是从多个异构来源收集的，然后进行组合。在梳理数据时，记录链接是链接引用同一实体的数据集中的记录的一项重要任务。记录链接通常不是无错误的；存在属于不同实体的记录被链接或者属于同一实体的记录丢失的可能性。简单地忽略这些错误是不可取的，因为它们可能导致数据污染，并在样本选择或估计中引入偏差，反过来，这可能导致误导性的统计结果和结论。很长一段时间以来，这个问题没有得到正确的认识，但近年来，越来越多的研究人员开发了处理关联数据集回归分析中的关联误差的方法。本概述的主要目标是介绍这些发展，重点介绍最近的方法及其与所谓的“破碎样本”问题的联系。我们还提供了一项简短的实证研究，说明了纠正方法在不同情况下的疗效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Wiley Interdisciplinary Reviews-Computational Statistics STATISTICS & PROBABILITY-

CiteScore

6.20

自引率

0.00%

发文量