Constructing synthetic populations in the age of big data.

IF 2.5 2区医学 Q2 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH Population Health Metrics Pub Date : 2023-10-31 DOI:10.1186/s12963-023-00319-5

Mioara A Nicolaie, Koen Füssenich, Caroline Ameling, Hendriek C Boshuizen

{"title":"Constructing synthetic populations in the age of big data.","authors":"Mioara A Nicolaie, Koen Füssenich, Caroline Ameling, Hendriek C Boshuizen","doi":"10.1186/s12963-023-00319-5","DOIUrl":null,"url":null,"abstract":"Background: To develop public health intervention models using micro-simulations, extensive personal information about inhabitants is needed, such as socio-demographic, economic and health figures. Confidentiality is an essential characteristic of such data, while the data should reflect realistic scenarios. Collection of such data is possible only in secured environments and not directly available for open-source micro-simulation models. The aim of this paper is to illustrate a method of construction of synthetic data by predicting individual features through models based on confidential data on health and socio-economic determinants of the entire Dutch population.Methods: Administrative records and health registry data were linked to socio-economic characteristics and self-reported lifestyle factors. For the entire Dutch population (n = 16,778,708), all socio-demographic information except lifestyle factors was available. Lifestyle factors were available from the 2012 Dutch Health Monitor (n = 370,835). Regression model was used to sequentially predict individual features.Results: The synthetic population resembles the original confidential population. Features predicted in the first stages of the sequential procedure are virtually similar to those in the original population, while those predicted in later stages of the sequential procedure carry the accumulation of limitations furthered by data quality and previously modelled features.Conclusions: By combining socio-demographic, economic, health and lifestyle related data at individual level on a large scale, our method provides us with a powerful tool to construct a synthetic population of good quality and with no confidentiality issues.","PeriodicalId":51476,"journal":{"name":"Population Health Metrics","volume":"21 1","pages":"19"},"PeriodicalIF":2.5000,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10617102/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Population Health Metrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12963-023-00319-5","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}

引用次数: 0

Abstract

Background: To develop public health intervention models using micro-simulations, extensive personal information about inhabitants is needed, such as socio-demographic, economic and health figures. Confidentiality is an essential characteristic of such data, while the data should reflect realistic scenarios. Collection of such data is possible only in secured environments and not directly available for open-source micro-simulation models. The aim of this paper is to illustrate a method of construction of synthetic data by predicting individual features through models based on confidential data on health and socio-economic determinants of the entire Dutch population.

Methods: Administrative records and health registry data were linked to socio-economic characteristics and self-reported lifestyle factors. For the entire Dutch population (n = 16,778,708), all socio-demographic information except lifestyle factors was available. Lifestyle factors were available from the 2012 Dutch Health Monitor (n = 370,835). Regression model was used to sequentially predict individual features.

Results: The synthetic population resembles the original confidential population. Features predicted in the first stages of the sequential procedure are virtually similar to those in the original population, while those predicted in later stages of the sequential procedure carry the accumulation of limitations furthered by data quality and previously modelled features.

Conclusions: By combining socio-demographic, economic, health and lifestyle related data at individual level on a large scale, our method provides us with a powerful tool to construct a synthetic population of good quality and with no confidentiality issues.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

构建大数据时代的合成种群。

背景：为了使用微观模拟开发公共卫生干预模型，需要大量的居民个人信息，如社会人口、经济和健康数据。保密性是此类数据的一个基本特征，而数据应反映现实情况。此类数据的收集只能在安全的环境中进行，而不能直接用于开源微模拟模型。本文的目的是说明一种构建合成数据的方法，通过基于整个荷兰人口健康和社会经济决定因素的机密数据的模型预测个体特征。方法：将行政记录和健康登记数据与社会经济特征和自我报告的生活方式因素联系起来。对于整个荷兰人口（n = 16778708），除生活方式因素外，所有社会人口统计信息都可用。生活方式因素可从2012年荷兰健康监测（n = 370835）。回归模型用于顺序预测个体特征。结果：合成种群与原始保密种群相似。在序列过程的第一阶段预测的特征实际上与原始人群中的特征相似，而在序列过程后期预测的特征则受到数据质量和先前建模特征的限制。结论：通过大规模结合个人层面的社会人口、经济、健康和生活方式相关数据，我们的方法为我们提供了一个强大的工具，可以构建一个质量良好、没有保密问题的合成人群。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Population Health Metrics PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH-

CiteScore

6.50

自引率

0.00%

发文量

审稿时长

29 weeks

期刊介绍： Population Health Metrics aims to advance the science of population health assessment, and welcomes papers relating to concepts, methods, ethics, applications, and summary measures of population health. The journal provides a unique platform for population health researchers to share their findings with the global community. We seek research that addresses the communication of population health measures and policy implications to stakeholders; this includes papers related to burden estimation and risk assessment, and research addressing population health across the full range of development. Population Health Metrics covers a broad range of topics encompassing health state measurement and valuation, summary measures of population health, descriptive epidemiology at the population level, burden of disease and injury analysis, disease and risk factor modeling for populations, and comparative assessment of risks to health at the population level. The journal is also interested in how to use and communicate indicators of population health to reduce disease burden, and the approaches for translating from indicators of population health to health-advancing actions. As a cross-cutting topic of importance, we are particularly interested in inequalities in population health and their measurement.