Joseph C. Ahn, Yung-Kyun Noh, Mingzhao Hu, Xiaotong Shen, Douglas A. Simonetto, Patrick S. Kamath, Rohit Loomba, Vijay H. Shah
{"title":"AI-driven synthetic data generation for accelerating hepatology research: A study of the United Network for Organ Sharing (UNOS) database","authors":"Joseph C. Ahn, Yung-Kyun Noh, Mingzhao Hu, Xiaotong Shen, Douglas A. Simonetto, Patrick S. Kamath, Rohit Loomba, Vijay H. Shah","doi":"10.1097/hep.0000000000001299","DOIUrl":null,"url":null,"abstract":"Background and Aims: Clinical hepatology research often faces limited data availability, underrepresentation of minority groups, and complex data-sharing regulations. Synthetic data—artificially generated patient records designed to mirror real-world distributions— offers a potential solution. We hypothesized that diffusion models, a state-of-the-art generative technique, could produce synthetic liver transplant waitlist data from the United Network for Organ Sharing (UNOS) database that maintains statistical fidelity, replicates clinical correlations and survival patterns, and ensures robust privacy protection. Methods: Diffusion models were used to generate synthetic patient cohorts mirroring the UNOS liver transplant waitlist database between years 2019 and 2023. Statistical fidelity was assessed using Maximum Mean Discrepancy (MMD) and Wasserstein distance, correlation analysis, and variable-level metrics. Clinical utility was evaluated by comparing transplant-free survival via Kaplan-Meier curves and the MELD score performance. Privacy was quantified using the Distance to Closest Record (DCR) and attribute disclosure risk assessments. Results: The synthetic dataset was nearly indistinguishable from the original dataset (MMD=0.002, standardized Wasserstein distance<1.0), preserving clinically relevant correlations and survival patterns as evidenced by similar median survival times (110 vs. 101 days) and 5-year survival rates (22.2% vs. 22.8%). MELD-based 90-day mortality prediction was maintained (original AUC=0.839 vs. synthetic AUC=0.844). Privacy metrics indicated no identifiable patient matches, and mean DCR values ensured that synthetic individuals were not direct replicas of real patients. Conclusion: AI-generated synthetic data derived from diffusion models can faithfully replicate complex hepatology datasets, maintain key clinical signals, and ensure strong privacy safeguards. This approach can help address data scarcity, enhance model generalizability, foster multi-institutional collaboration, and accelerate progress in hepatology research.","PeriodicalId":177,"journal":{"name":"Hepatology","volume":"39 1","pages":""},"PeriodicalIF":15.8000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Hepatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/hep.0000000000001299","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background and Aims: Clinical hepatology research often faces limited data availability, underrepresentation of minority groups, and complex data-sharing regulations. Synthetic data—artificially generated patient records designed to mirror real-world distributions— offers a potential solution. We hypothesized that diffusion models, a state-of-the-art generative technique, could produce synthetic liver transplant waitlist data from the United Network for Organ Sharing (UNOS) database that maintains statistical fidelity, replicates clinical correlations and survival patterns, and ensures robust privacy protection. Methods: Diffusion models were used to generate synthetic patient cohorts mirroring the UNOS liver transplant waitlist database between years 2019 and 2023. Statistical fidelity was assessed using Maximum Mean Discrepancy (MMD) and Wasserstein distance, correlation analysis, and variable-level metrics. Clinical utility was evaluated by comparing transplant-free survival via Kaplan-Meier curves and the MELD score performance. Privacy was quantified using the Distance to Closest Record (DCR) and attribute disclosure risk assessments. Results: The synthetic dataset was nearly indistinguishable from the original dataset (MMD=0.002, standardized Wasserstein distance<1.0), preserving clinically relevant correlations and survival patterns as evidenced by similar median survival times (110 vs. 101 days) and 5-year survival rates (22.2% vs. 22.8%). MELD-based 90-day mortality prediction was maintained (original AUC=0.839 vs. synthetic AUC=0.844). Privacy metrics indicated no identifiable patient matches, and mean DCR values ensured that synthetic individuals were not direct replicas of real patients. Conclusion: AI-generated synthetic data derived from diffusion models can faithfully replicate complex hepatology datasets, maintain key clinical signals, and ensure strong privacy safeguards. This approach can help address data scarcity, enhance model generalizability, foster multi-institutional collaboration, and accelerate progress in hepatology research.
背景和目的:临床肝病学研究经常面临数据可用性有限、少数群体代表性不足和复杂的数据共享法规。合成数据——人为生成的病人记录,旨在反映现实世界的分布——提供了一个潜在的解决方案。我们假设扩散模型,一种最先进的生成技术,可以从联合器官共享网络(UNOS)数据库中生成合成的肝移植等待名单数据,保持统计保真度,复制临床相关性和生存模式,并确保强大的隐私保护。方法:采用扩散模型生成与2019年至2023年UNOS肝移植等待名单数据库相对应的合成患者队列。采用最大平均差异(MMD)和Wasserstein距离、相关分析和变量水平指标评估统计保真度。通过Kaplan-Meier曲线和MELD评分来比较无移植生存期,评估临床效用。使用最近记录距离(DCR)和属性披露风险评估对隐私进行量化。结果:合成数据集与原始数据集几乎无法区分(MMD=0.002,标准化Wasserstein距离<;1.0),保留了相似的中位生存时间(110天vs. 101天)和5年生存率(22.2% vs. 22.8%)所证明的临床相关性和生存模式。基于meld的90天死亡率预测维持不变(原始AUC=0.839 vs.合成AUC=0.844)。隐私指标表明没有可识别的患者匹配,平均DCR值确保合成个体不是真实患者的直接复制品。结论:人工智能生成的来自扩散模型的合成数据可以忠实地复制复杂的肝病数据集,维护关键的临床信号,并确保强有力的隐私保护。这种方法可以帮助解决数据短缺问题,增强模型的通用性,促进多机构合作,并加速肝病学研究的进展。
期刊介绍:
HEPATOLOGY is recognized as the leading publication in the field of liver disease. It features original, peer-reviewed articles covering various aspects of liver structure, function, and disease. The journal's distinguished Editorial Board carefully selects the best articles each month, focusing on topics including immunology, chronic hepatitis, viral hepatitis, cirrhosis, genetic and metabolic liver diseases, liver cancer, and drug metabolism.