Joshi Hogenboom, Aiara Lobo Gomes, Andre Dekker, Winette Van Der Graaf, Olga Husson, Leonard Wee
{"title":"Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer.","authors":"Joshi Hogenboom, Aiara Lobo Gomes, Andre Dekker, Winette Van Der Graaf, Olga Husson, Leonard Wee","doi":"10.1200/CCI.24.00056","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Research on rare diseases and atypical health care demographics is often slowed by high interparticipant heterogeneity and overall scarcity of data. Synthetic data (SD) have been proposed as means for data sharing, enlargement, and diversification, by artificially generating real phenomena while obscuring the real patient data. The utility of SD is actively scrutinized in health care research, but the role of sample size for actionability of SD is insufficiently explored. We aim to understand the interplay of actionability and sample size by generating SD sets of varying sizes from gradually diminishing amounts of real individuals' data. We evaluate the actionability of SD in a highly heterogeneous and rare demographic: adolescents and young adults (AYAs) with cancer.</p><p><strong>Methods: </strong>A population-based cross-sectional cohort study of 3,735 AYAs was subsampled at random to produce 13 training data sets of varying sample sizes. We studied four distinct generator architectures built on the open-source Synthetic Data Vault library. Each architecture was used to generate SD of varying sizes on the basis of each aforementioned training subsets. SD actionability was assessed by comparing the resulting SD with their respective real data against three metrics-veracity, utility, and privacy concealment.</p><p><strong>Results: </strong>All examined generator architectures yielded actionable data when generating SD with sizes similar to the real data. Large SD sample size increased veracity but generally increased privacy risks. Using fewer training participants led to faster convergence in veracity, but partially exacerbated privacy concealment issues.</p><p><strong>Conclusion: </strong>SD is a potentially promising option for data sharing and data augmentation, yet sample size plays a significant role in its actionability. SD generation should go hand-in-hand with consistent scrutiny, and sample size should be carefully considered in this process.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"8 ","pages":"e2400056"},"PeriodicalIF":3.3000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11627331/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI.24.00056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/3 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Research on rare diseases and atypical health care demographics is often slowed by high interparticipant heterogeneity and overall scarcity of data. Synthetic data (SD) have been proposed as means for data sharing, enlargement, and diversification, by artificially generating real phenomena while obscuring the real patient data. The utility of SD is actively scrutinized in health care research, but the role of sample size for actionability of SD is insufficiently explored. We aim to understand the interplay of actionability and sample size by generating SD sets of varying sizes from gradually diminishing amounts of real individuals' data. We evaluate the actionability of SD in a highly heterogeneous and rare demographic: adolescents and young adults (AYAs) with cancer.
Methods: A population-based cross-sectional cohort study of 3,735 AYAs was subsampled at random to produce 13 training data sets of varying sample sizes. We studied four distinct generator architectures built on the open-source Synthetic Data Vault library. Each architecture was used to generate SD of varying sizes on the basis of each aforementioned training subsets. SD actionability was assessed by comparing the resulting SD with their respective real data against three metrics-veracity, utility, and privacy concealment.
Results: All examined generator architectures yielded actionable data when generating SD with sizes similar to the real data. Large SD sample size increased veracity but generally increased privacy risks. Using fewer training participants led to faster convergence in veracity, but partially exacerbated privacy concealment issues.
Conclusion: SD is a potentially promising option for data sharing and data augmentation, yet sample size plays a significant role in its actionability. SD generation should go hand-in-hand with consistent scrutiny, and sample size should be carefully considered in this process.
目的:对罕见病和非典型卫生保健人口统计的研究往往因参与者之间的高度异质性和数据的总体稀缺性而减慢。合成数据(SD)被提出作为数据共享、扩大和多样化的手段,通过人为地产生真实的现象,同时模糊真实的患者数据。在卫生保健研究中,SD的效用受到了积极的审视,但样本大小对SD可操作性的作用尚未得到充分的探讨。我们的目标是通过从逐渐减少的真实个人数据中生成不同大小的SD集来理解可操作性和样本量之间的相互作用。我们评估了SD在一个高度异质性和罕见的人口统计学中的可操作性:患有癌症的青少年和年轻人(AYAs)。方法:以人群为基础的横断面队列研究,随机抽样3,735名AYAs,产生13个不同样本量的训练数据集。我们研究了基于开源Synthetic Data Vault库构建的四种不同的生成器体系结构。每一种体系结构都被用来在上述每个训练子集的基础上生成不同大小的SD。通过将结果SD与各自的真实数据与三个指标(准确性、实用性和隐私隐蔽性)进行比较,评估SD的可操作性。结果:当生成大小与真实数据相似的SD时,所有检查的生成器架构都产生了可操作的数据。较大的SD样本量增加了准确性,但通常增加了隐私风险。使用较少的培训参与者可以加快准确性的收敛速度,但在一定程度上加剧了隐私隐藏问题。结论:SD是一种潜在的有前途的数据共享和数据增强选择,但样本量在其可操作性中起着重要作用。SD生成应与持续的审查齐头并进,在此过程中应仔细考虑样本大小。