Beatrix Haddock, Alix Pletcher, Nathaniel Blair-Stahn, O. Keyes, Matt Kappel, Steve Bachmeier, Syl Lutze, James Albright, Alison Bowman, Caroline Kinuthia, Zeb Burke-Conte, Rajan Mudambi, Abraham Flaxman
{"title":"无隐私限制的普查规模实体解析研究模拟数据:基于个体建模生成的大规模数据集","authors":"Beatrix Haddock, Alix Pletcher, Nathaniel Blair-Stahn, O. Keyes, Matt Kappel, Steve Bachmeier, Syl Lutze, James Albright, Alison Bowman, Caroline Kinuthia, Zeb Burke-Conte, Rajan Mudambi, Abraham Flaxman","doi":"10.12688/gatesopenres.15418.1","DOIUrl":null,"url":null,"abstract":"Background Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information. Methods We created the simulated population data available through pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems. Results Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.","PeriodicalId":504483,"journal":{"name":"Gates Open Research","volume":"8 6","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling\",\"authors\":\"Beatrix Haddock, Alix Pletcher, Nathaniel Blair-Stahn, O. Keyes, Matt Kappel, Steve Bachmeier, Syl Lutze, James Albright, Alison Bowman, Caroline Kinuthia, Zeb Burke-Conte, Rajan Mudambi, Abraham Flaxman\",\"doi\":\"10.12688/gatesopenres.15418.1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information. Methods We created the simulated population data available through pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems. Results Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.\",\"PeriodicalId\":504483,\"journal\":{\"name\":\"Gates Open Research\",\"volume\":\"8 6\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Gates Open Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12688/gatesopenres.15418.1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Gates Open Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12688/gatesopenres.15418.1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling
Background Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information. Methods We created the simulated population data available through pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems. Results Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.