Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer.

IF 3.3 Q2 ONCOLOGY JCO Clinical Cancer Informatics Pub Date : 2024-12-01 Epub Date: 2024-12-03 DOI:10.1200/CCI.24.00056
Joshi Hogenboom, Aiara Lobo Gomes, Andre Dekker, Winette Van Der Graaf, Olga Husson, Leonard Wee
{"title":"Actionability of Synthetic Data in a Heterogeneous and Rare Health Care Demographic: Adolescents and Young Adults With Cancer.","authors":"Joshi Hogenboom, Aiara Lobo Gomes, Andre Dekker, Winette Van Der Graaf, Olga Husson, Leonard Wee","doi":"10.1200/CCI.24.00056","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Research on rare diseases and atypical health care demographics is often slowed by high interparticipant heterogeneity and overall scarcity of data. Synthetic data (SD) have been proposed as means for data sharing, enlargement, and diversification, by artificially generating real phenomena while obscuring the real patient data. The utility of SD is actively scrutinized in health care research, but the role of sample size for actionability of SD is insufficiently explored. We aim to understand the interplay of actionability and sample size by generating SD sets of varying sizes from gradually diminishing amounts of real individuals' data. We evaluate the actionability of SD in a highly heterogeneous and rare demographic: adolescents and young adults (AYAs) with cancer.</p><p><strong>Methods: </strong>A population-based cross-sectional cohort study of 3,735 AYAs was subsampled at random to produce 13 training data sets of varying sample sizes. We studied four distinct generator architectures built on the open-source Synthetic Data Vault library. Each architecture was used to generate SD of varying sizes on the basis of each aforementioned training subsets. SD actionability was assessed by comparing the resulting SD with their respective real data against three metrics-veracity, utility, and privacy concealment.</p><p><strong>Results: </strong>All examined generator architectures yielded actionable data when generating SD with sizes similar to the real data. Large SD sample size increased veracity but generally increased privacy risks. Using fewer training participants led to faster convergence in veracity, but partially exacerbated privacy concealment issues.</p><p><strong>Conclusion: </strong>SD is a potentially promising option for data sharing and data augmentation, yet sample size plays a significant role in its actionability. SD generation should go hand-in-hand with consistent scrutiny, and sample size should be carefully considered in this process.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"8 ","pages":"e2400056"},"PeriodicalIF":3.3000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11627331/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI.24.00056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/3 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: Research on rare diseases and atypical health care demographics is often slowed by high interparticipant heterogeneity and overall scarcity of data. Synthetic data (SD) have been proposed as means for data sharing, enlargement, and diversification, by artificially generating real phenomena while obscuring the real patient data. The utility of SD is actively scrutinized in health care research, but the role of sample size for actionability of SD is insufficiently explored. We aim to understand the interplay of actionability and sample size by generating SD sets of varying sizes from gradually diminishing amounts of real individuals' data. We evaluate the actionability of SD in a highly heterogeneous and rare demographic: adolescents and young adults (AYAs) with cancer.

Methods: A population-based cross-sectional cohort study of 3,735 AYAs was subsampled at random to produce 13 training data sets of varying sample sizes. We studied four distinct generator architectures built on the open-source Synthetic Data Vault library. Each architecture was used to generate SD of varying sizes on the basis of each aforementioned training subsets. SD actionability was assessed by comparing the resulting SD with their respective real data against three metrics-veracity, utility, and privacy concealment.

Results: All examined generator architectures yielded actionable data when generating SD with sizes similar to the real data. Large SD sample size increased veracity but generally increased privacy risks. Using fewer training participants led to faster convergence in veracity, but partially exacerbated privacy concealment issues.

Conclusion: SD is a potentially promising option for data sharing and data augmentation, yet sample size plays a significant role in its actionability. SD generation should go hand-in-hand with consistent scrutiny, and sample size should be carefully considered in this process.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
综合数据在异质性和罕见的卫生保健人口统计中的可操作性:患有癌症的青少年和年轻成人。
目的:对罕见病和非典型卫生保健人口统计的研究往往因参与者之间的高度异质性和数据的总体稀缺性而减慢。合成数据(SD)被提出作为数据共享、扩大和多样化的手段,通过人为地产生真实的现象,同时模糊真实的患者数据。在卫生保健研究中,SD的效用受到了积极的审视,但样本大小对SD可操作性的作用尚未得到充分的探讨。我们的目标是通过从逐渐减少的真实个人数据中生成不同大小的SD集来理解可操作性和样本量之间的相互作用。我们评估了SD在一个高度异质性和罕见的人口统计学中的可操作性:患有癌症的青少年和年轻人(AYAs)。方法:以人群为基础的横断面队列研究,随机抽样3,735名AYAs,产生13个不同样本量的训练数据集。我们研究了基于开源Synthetic Data Vault库构建的四种不同的生成器体系结构。每一种体系结构都被用来在上述每个训练子集的基础上生成不同大小的SD。通过将结果SD与各自的真实数据与三个指标(准确性、实用性和隐私隐蔽性)进行比较,评估SD的可操作性。结果:当生成大小与真实数据相似的SD时,所有检查的生成器架构都产生了可操作的数据。较大的SD样本量增加了准确性,但通常增加了隐私风险。使用较少的培训参与者可以加快准确性的收敛速度,但在一定程度上加剧了隐私隐藏问题。结论:SD是一种潜在的有前途的数据共享和数据增强选择,但样本量在其可操作性中起着重要作用。SD生成应与持续的审查齐头并进,在此过程中应仔细考虑样本大小。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.20
自引率
4.80%
发文量
190
期刊最新文献
CFO: Calibration-Free Odds Bayesian Designs for Dose Finding in Clinical Trials. Patient-Reported Outcomes: Comparing Functional Avoidance and Standard Thoracic Radiation Therapy in Lung Cancer. Advancements in Interoperability: Achieving Anatomic Pathology Reports That Adhere to International Standards and Are Both Human-Readable and Readily Computable. Incorporating Structured and Unstructured Data Sources to Identify and Characterize Hereditary Cancer Testing Among Veterans With Metastatic Castration-Resistant Prostate Cancer. Leveraging Radiotherapy Data for Precision Oncology: Veterans Affairs Granular Radiotherapy Information Database.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1