Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling

Beatrix Haddock, Alix Pletcher, Nathaniel Blair-Stahn, O. Keyes, Matt Kappel, Steve Bachmeier, Syl Lutze, James Albright, Alison Bowman, Caroline Kinuthia, Zeb Burke-Conte, Rajan Mudambi, Abraham Flaxman
{"title":"Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling","authors":"Beatrix Haddock, Alix Pletcher, Nathaniel Blair-Stahn, O. Keyes, Matt Kappel, Steve Bachmeier, Syl Lutze, James Albright, Alison Bowman, Caroline Kinuthia, Zeb Burke-Conte, Rajan Mudambi, Abraham Flaxman","doi":"10.12688/gatesopenres.15418.1","DOIUrl":null,"url":null,"abstract":"Background Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information. Methods We created the simulated population data available through pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems. Results Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.","PeriodicalId":504483,"journal":{"name":"Gates Open Research","volume":"8 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Gates Open Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12688/gatesopenres.15418.1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information. Methods We created the simulated population data available through pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems. Results Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
无隐私限制的普查规模实体解析研究模拟数据:基于个体建模生成的大规模数据集
背景 实体解析(ER)是指识别和链接指向同一现实世界实体的记录的过程。实体解析是数据科学中的一个基本挑战,而实体解析研究和开发的一个常见障碍是,用于模糊匹配的数据字段是个人身份信息,如姓名、地址和出生日期。对访问和共享这些真实数据的必要限制减缓了ER新方法和软件的开发、测试和采用工作。我们最近发布了一个 Python 软件包 pseudopeople,它允许用户生成模拟数据集,其规模和复杂程度接近大型组织和联邦机构(如美国人口普查局)定期执行 ER 的数据。有了 pseudopeople,研究人员就可以开发用于美国人口数据ER的新算法和软件,而无需访问个人机密信息。方法 我们利用 Vivarium 仿真平台创建了可通过 pseudopeople 获取的模拟人口数据。我们的模型模拟了个人及其家庭、住户和就业在一段时间内的动态变化,我们通过模拟人口普查、调查和行政数据收集系统对其进行观察。结果 我们的模拟过程产生了超过 900 千兆字节的模拟人口普查、调查和行政数据,代表了数以亿计的模拟人。由数千个模拟人组成的模拟人口样本现已向所有伪人民软件包用户开放,而由数百万和数亿模拟人组成的大规模模拟人口也可通过 GitHub 在线申请获得。这些模拟种群数据的结构可供伪人群软件包使用,该软件包还可为数据添加各种噪声,从而为ER研究人员提供现实的、可共享的挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
"Life mapping" exploring the lived experience of COVID-19 on access to HIV treatment and care in Malawi Contraceptive access and use before and during the COVID-19 pandemic: a mixed-methods study in South Africa and Zambia Unpacking WHO and CDC Bottle Bioassay Methods: A Comprehensive Literature Review and Protocol Analysis Revealing Key Outcome Predictors Strengthening district health management and planning: an evaluation of a multi-country initiative in Eastern and Southern Africa Identification of latent contraceptive ideational profiles among urban women in Senegal: Transitions and implications for family planning programs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1