Linking individuals across historical sources: A fully automated approach*

Historical Methods: A Journal of Quantitative and Interdisciplinary History Pub Date : 2018-02-01 DOI:10.1080/01615440.2018.1543034

Ran Abramitzky, R. Mill, Santiago Pérez

{"title":"Linking individuals across historical sources: A fully automated approach*","authors":"Ran Abramitzky, R. Mill, Santiago Pérez","doi":"10.1080/01615440.2018.1543034","DOIUrl":null,"url":null,"abstract":"Abstract Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. In the first part of the paper, we suggest a fully automated probabilistic method for linking historical datasets that enables researchers to create samples at the frontier of minimizing type I (false positives) and type II (false negatives) errors. The first step guides researchers in the choice of which variables to use for linking. The second step uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two records correspond to the same individual. The third step suggests how to use these estimated probabilities to choose which records to use in the analysis. In the second part of the paper, we apply the method to link historical population censuses in the US and Norway, and use these samples to estimate measures of intergenerational occupational mobility. The estimates using our method are remarkably similar to the ones using IPUMS’, which relies on hand linking to create a training sample. We created an R code and a Stata command that implement this method.","PeriodicalId":154465,"journal":{"name":"Historical Methods: A Journal of Quantitative and Interdisciplinary History","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Historical Methods: A Journal of Quantitative and Interdisciplinary History","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/01615440.2018.1543034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 61

Abstract

Abstract Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. In the first part of the paper, we suggest a fully automated probabilistic method for linking historical datasets that enables researchers to create samples at the frontier of minimizing type I (false positives) and type II (false negatives) errors. The first step guides researchers in the choice of which variables to use for linking. The second step uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two records correspond to the same individual. The third step suggests how to use these estimated probabilities to choose which records to use in the analysis. In the second part of the paper, we apply the method to link historical population censuses in the US and Norway, and use these samples to estimate measures of intergenerational occupational mobility. The estimates using our method are remarkably similar to the ones using IPUMS’, which relies on hand linking to create a training sample. We created an R code and a Stata command that implement this method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

跨历史来源链接个人:一个完全自动化的方法*

跨历史数据集连接个体依赖于姓名和年龄等信息，这些信息既非唯一，又容易出现枚举和转录错误。这些错误使得不可能确定地找到正确的匹配。在本文的第一部分中，我们提出了一种全自动概率方法，用于链接历史数据集，使研究人员能够在最小化I型(假阳性)和II型(假阴性)错误的前沿创建样本。第一步指导研究人员选择使用哪些变量进行链接。第二步使用期望最大化(EM)算法(统计学中的标准工具)来计算每两条记录对应于同一个人的概率。第三步建议如何使用这些估计的概率来选择在分析中使用哪些记录。在本文的第二部分，我们将该方法应用于美国和挪威的历史人口普查，并使用这些样本来估计代际职业流动的措施。使用我们方法的估计与使用IPUMS方法的估计非常相似，IPUMS方法依赖于手链接来创建训练样本。我们创建了一个R代码和一个Stata命令来实现这个方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Historical Methods: A Journal of Quantitative and Interdisciplinary History

自引率

0.00%

发文量