Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups

IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES BMC Medical Research Methodology Pub Date : 2024-09-09 DOI:10.1186/s12874-024-02327-x
Kiana Farhadyar, Federico Bonofiglio, Maren Hackenberg, Max Behrens, Daniela Zöller, Harald Binder
{"title":"Combining propensity score methods with variational autoencoders for generating synthetic data in presence of latent sub-groups","authors":"Kiana Farhadyar, Federico Bonofiglio, Maren Hackenberg, Max Behrens, Daniela Zöller, Harald Binder","doi":"10.1186/s12874-024-02327-x","DOIUrl":null,"url":null,"abstract":"In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts.","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-024-02327-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

In settings requiring synthetic data generation based on a clinical cohort, e.g., due to data protection regulations, heterogeneity across individuals might be a nuisance that we need to control or faithfully preserve. The sources of such heterogeneity might be known, e.g., as indicated by sub-groups labels, or might be unknown and thus reflected only in properties of distributions, such as bimodality or skewness. We investigate how such heterogeneity can be preserved and controlled when obtaining synthetic data from variational autoencoders (VAEs), i.e., a generative deep learning technique that utilizes a low-dimensional latent representation. To faithfully reproduce unknown heterogeneity reflected in marginal distributions, we propose to combine VAEs with pre-transformations. For dealing with known heterogeneity due to sub-groups, we complement VAEs with models for group membership, specifically from propensity score regression. The evaluation is performed with a realistic simulation design that features sub-groups and challenging marginal distributions. The proposed approach faithfully recovers the latter, compared to synthetic data approaches that focus purely on marginal distributions. Propensity scores add complementary information, e.g., when visualized in the latent space, and enable sampling of synthetic data with or without sub-group specific characteristics. We also illustrate the proposed approach with real data from an international stroke trial that exhibits considerable distribution differences between study sites, in addition to bimodality. These results indicate that describing heterogeneity by statistical approaches, such as propensity score regression, might be more generally useful for complementing generative deep learning for obtaining synthetic data that faithfully reflects structure from clinical cohorts.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
将倾向得分法与变异自动编码器相结合,在存在潜在子群的情况下生成合成数据
在需要根据临床队列生成合成数据的情况下(例如,由于数据保护规定),个体间的异质性可能是我们需要控制或忠实保留的麻烦。这种异质性的来源可能是已知的,如子组标签所示,也可能是未知的,因此只反映在分布的属性上,如双峰性或偏度。我们研究了在从变异自动编码器(VAE)(即一种利用低维潜在表示的生成式深度学习技术)获取合成数据时,如何保留和控制这种异质性。为了忠实再现边际分布中反映的未知异质性,我们建议将变异自动编码器与预变换相结合。为了处理因子群体而产生的已知异质性,我们利用群体成员模型(特别是倾向得分回归模型)对 VAE 进行了补充。评估是通过以分组和具有挑战性的边际分布为特征的现实模拟设计进行的。与纯粹关注边际分布的合成数据方法相比,所提出的方法能忠实地恢复后者。倾向分数增加了补充信息,例如在潜在空间中可视化时,并能对具有或不具有特定子群特征的合成数据进行采样。我们还利用一项国际中风试验的真实数据对所提出的方法进行了说明,除了双峰性之外,该试验还显示出不同研究地点之间存在相当大的分布差异。这些结果表明,通过倾向得分回归等统计方法来描述异质性,可能更有助于补充生成式深度学习,从而获得忠实反映临床队列结构的合成数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Medical Research Methodology
BMC Medical Research Methodology 医学-卫生保健
CiteScore
6.50
自引率
2.50%
发文量
298
审稿时长
3-8 weeks
期刊介绍: BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.
期刊最新文献
Challenges in measurement of adolescent mental health: how are gender patterns affected when level of symptoms is analysed simultaneously with impairment? Motivations for enrollment in a COVID-19 ring-based post-exposure prophylaxis trial: qualitative examination of participant experiences. Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool. Bayesian additive regression trees for predicting childhood asthma in the CHILD cohort study. Incorporating external controls in the design of randomized clinical trials: a case study in solid tumors.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1