Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.

IF 4.7 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Frontiers in Artificial Intelligence Pub Date : 2025-02-05 eCollection Date: 2025-01-01 DOI:10.3389/frai.2025.1533508
Austin A Barr, Joshua Quan, Eddie Guo, Emre Sezgin
{"title":"Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.","authors":"Austin A Barr, Joshua Quan, Eddie Guo, Emre Sezgin","doi":"10.3389/frai.2025.1533508","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Clinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.</p><p><strong>Objective: </strong>This study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI's GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.</p><p><strong>Methods: </strong>In Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample <i>t</i>-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.</p><p><strong>Results: </strong>In Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.</p><p><strong>Conclusion: </strong>Zero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.</p>","PeriodicalId":33315,"journal":{"name":"Frontiers in Artificial Intelligence","volume":"8 ","pages":"1533508"},"PeriodicalIF":4.7000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11836953/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frai.2025.1533508","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Clinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.

Objective: This study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI's GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.

Methods: In Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.

Results: In Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.

Conclusion: Zero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
生成合成临床数据集的大型语言模型:与现实世界围手术期数据的可行性和比较分析。
背景:临床数据有助于医学研究、机器学习(ML)模型开发和推进外科护理,但访问往往受到隐私法规和数据缺失的限制。合成数据提供了一种很有前途的解决方案,既能保护隐私,又能实现更广泛的数据访问。大型语言模型(llm)的最新进展为生成合成数据提供了机会,减少了对领域专业知识、计算资源和预训练的依赖。目的:本研究旨在评估使用OpenAI的gpt - 40零次提示生成真实表格临床数据的可行性,并通过将llm生成的数据的统计特性与现实世界开源围手术期数据集Vital Signs DataBase (VitalDB)进行比较,评估llm生成数据的保真度。方法:在第一阶段,gpt - 40被提示生成一个包含13个临床参数定性描述的数据集。评估所得数据的一般误差、输出的合理性和相关参数的交叉验证。在第二阶段,gpt - 40被提示使用VitalDB数据集的描述性统计生成数据集。使用两样本t检验、两样本比例检验和95%置信区间(CI)重叠来评估保真度。结果:在第一阶段,gpt - 40生成了包含6166个病例文件的完整结构化数据集。该数据集在范围上是合理的,并且基于各自的身高和体重正确计算了所有病例文件的体重指数。llm生成的数据集与VitalDB之间的统计比较显示,二期数据具有显著的保真度。2期数据显示12/13个参数(92.31%)具有统计学相似性,其中6/6个分类/二元参数(100.0%)和6/7个连续参数(85.71%)无统计学差异。6/7个(85.71%)连续参数中95% ci重合。结论:gpt - 40零射提示可以生成真实的表格式合成数据集,可以复制真实围手术期数据的关键统计特性。这项研究强调了llm作为合成数据生成的一种新颖且可访问的模式的潜力,这可能会解决临床数据访问中的关键障碍,并消除对技术专长、大量计算资源和预培训的需求。进一步的研究是必要的,以提高保真度,并调查使用llm来扩大和增加数据集,保存多变量关系,并训练鲁棒的ML模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.10
自引率
2.50%
发文量
272
审稿时长
13 weeks
期刊最新文献
Generalization bounds for a generator-regularized InfoGAN-inspired adversarial objective. Anemia in young women: determinants and artificial intelligence-based management approaches. Improvements to dark experience replay and reservoir sampling for better balance between consolidation and plasticity. Driving factors of agricultural artificial intelligence adoption intention: an empirical study in Shandong province based on innovation characteristics, technology commitment, and individual heterogeneity. Computational understanding of non-coding RNA pairwise interactions.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1