Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups

IF 3.9 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES BMC Medical Research Methodology Pub Date : 2024-09-09 DOI:10.1186/s12874-024-02327-x

Kiana Farhadyar, Federico Bonofiglio, Maren Hackenberg, Max Behrens, Daniela Zöller, Harald Binder

{"title":"Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups","authors":"Kiana Farhadyar, Federico Bonofiglio, Maren Hackenberg, Max Behrens, Daniela Zöller, Harald Binder","doi":"10.1186/s12874-024-02327-x","DOIUrl":null,"url":null,"abstract":"In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts.","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-024-02327-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

将倾向得分法与变异自动编码器相结合，在存在潜在子群的情况下生成合成数据

在需要根据临床队列生成合成数据的情况下（例如，由于数据保护规定），个体间的异质性可能是我们需要控制或忠实保留的麻烦。这种异质性的来源可能是已知的，如子组标签所示，也可能是未知的，因此只反映在分布的属性上，如双峰性或偏度。我们研究了在从变异自动编码器（VAE）（即一种利用低维潜在表示的生成式深度学习技术）获取合成数据时，如何保留和控制这种异质性。为了忠实再现边际分布中反映的未知异质性，我们建议将变异自动编码器与预变换相结合。为了处理因子群体而产生的已知异质性，我们利用群体成员模型（特别是倾向得分回归模型）对 VAE 进行了补充。评估是通过以分组和具有挑战性的边际分布为特征的现实模拟设计进行的。与纯粹关注边际分布的合成数据方法相比，所提出的方法能忠实地恢复后者。倾向分数增加了补充信息，例如在潜在空间中可视化时，并能对具有或不具有特定子群特征的合成数据进行采样。我们还利用一项国际中风试验的真实数据对所提出的方法进行了说明，除了双峰性之外，该试验还显示出不同研究地点之间存在相当大的分布差异。这些结果表明，通过倾向得分回归等统计方法来描述异质性，可能更有助于补充生成式深度学习，从而获得忠实反映临床队列结构的合成数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BMC Medical Research Methodology 医学-卫生保健

CiteScore

6.50

自引率

2.50%

发文量

298

审稿时长

3-8 weeks

期刊介绍： BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.