CTGAN-driven synthetic data generation: A multidisciplinary, expert-guided approach (TIMA)

IF 4.9 2区 医学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computer methods and programs in biomedicine Pub Date : 2024-11-19 DOI:10.1016/j.cmpb.2024.108523
Orlando Parise , Rani Kronenberger , Gianmarco Parise , Carlo de Asmundis , Sandro Gelsomino , Mark La Meir
{"title":"CTGAN-driven synthetic data generation: A multidisciplinary, expert-guided approach (TIMA)","authors":"Orlando Parise ,&nbsp;Rani Kronenberger ,&nbsp;Gianmarco Parise ,&nbsp;Carlo de Asmundis ,&nbsp;Sandro Gelsomino ,&nbsp;Mark La Meir","doi":"10.1016/j.cmpb.2024.108523","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>We generated synthetic data starting from a population of two hundred thirty-eight adults SARS-CoV-2 positive patients admitted to the University Hospital of Brussels, Belgium, in 2020, utilizing a Conditional Tabular Generative Adversarial Network (CTGAN)-based technique with the aim of testing the performance, representativeness, realism, novelty, and diversity of synthetic data generated from a small patient sample. A Multidisciplinary Approach (TIMA) incorporates active participation from a medical team throughout the various stages of this process.</div></div><div><h3>Methods</h3><div>The TIMA committee scrutinized data for inconsistencies, implementing stringent rules for variables unlearned by the system. A sensitivity analysis determined 100,000 epochs, leading to the generation of 10,000 synthetic data. The model's performance was tested using a general-purpose dataset, comparing real and synthetic data.</div></div><div><h3>Results</h3><div>Outcomes indicate the robustness of our model, with an average contingency score of 0.94 across variable pairs in synthetic and real data. Continuous variables exhibited a median correlation similarity score of 0.97. Novelty received a top score of 1. Principal Component Analysis (PCA) on synthetic values demonstrated diversity, as no patient pair displayed a zero or close-to-zero value distance. Remarkably, the TIMA committee's evaluation revealed that synthetic data was recognized as authentic by nearly 100%.</div></div><div><h3>Conclusions</h3><div>Our trained model exhibited commendable performance, yielding high representativeness in the synthetic dataset compared to the original. The synthetic dataset proved realistic, boasting elevated levels of novelty and diversity.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"259 ","pages":"Article 108523"},"PeriodicalIF":4.9000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260724005169","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Objective

We generated synthetic data starting from a population of two hundred thirty-eight adults SARS-CoV-2 positive patients admitted to the University Hospital of Brussels, Belgium, in 2020, utilizing a Conditional Tabular Generative Adversarial Network (CTGAN)-based technique with the aim of testing the performance, representativeness, realism, novelty, and diversity of synthetic data generated from a small patient sample. A Multidisciplinary Approach (TIMA) incorporates active participation from a medical team throughout the various stages of this process.

Methods

The TIMA committee scrutinized data for inconsistencies, implementing stringent rules for variables unlearned by the system. A sensitivity analysis determined 100,000 epochs, leading to the generation of 10,000 synthetic data. The model's performance was tested using a general-purpose dataset, comparing real and synthetic data.

Results

Outcomes indicate the robustness of our model, with an average contingency score of 0.94 across variable pairs in synthetic and real data. Continuous variables exhibited a median correlation similarity score of 0.97. Novelty received a top score of 1. Principal Component Analysis (PCA) on synthetic values demonstrated diversity, as no patient pair displayed a zero or close-to-zero value distance. Remarkably, the TIMA committee's evaluation revealed that synthetic data was recognized as authentic by nearly 100%.

Conclusions

Our trained model exhibited commendable performance, yielding high representativeness in the synthetic dataset compared to the original. The synthetic dataset proved realistic, boasting elevated levels of novelty and diversity.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CTGAN 驱动的合成数据生成:多学科专家指导方法(TIMA)
目标我们利用基于条件表生成对抗网络(CTGAN)的技术,从 2020 年比利时布鲁塞尔大学医院收治的 238 名 SARS-CoV-2 阳性成人患者开始生成合成数据,目的是测试从少量患者样本中生成的合成数据的性能、代表性、真实性、新颖性和多样性。多学科方法(TIMA)将医疗团队的积极参与贯穿于这一过程的各个阶段。方法 TIMA 委员会仔细检查数据的不一致性,对系统未学习的变量实施严格的规则。敏感性分析确定了 100,000 个历时,从而生成了 10,000 个合成数据。结果表明,我们的模型非常稳健,合成数据和真实数据中变量对的平均或然率为 0.94。连续变量的中位相关相似度得分为 0.97。合成值的主成分分析(PCA)显示了多样性,因为没有一对患者的值距为零或接近零。值得注意的是,TIMA 委员会的评估显示,合成数据的真实度接近 100%。结论我们训练有素的模型表现出了值得称赞的性能,与原始数据相比,合成数据集具有很高的代表性。合成数据集被证明是真实的,具有更高的新颖性和多样性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer methods and programs in biomedicine
Computer methods and programs in biomedicine 工程技术-工程:生物医学
CiteScore
12.30
自引率
6.60%
发文量
601
审稿时长
135 days
期刊介绍: To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine. Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.
期刊最新文献
Computational hemodynamic indices to identify Transcatheter Aortic Valve Implantation degeneration One-class classification with confound control for cognitive screening in older adults using gait, fingertapping, cognitive, and dual tasks A porohyperelastic scheme targeted at High-Performance Computing frameworks for the simulation of the intervertebral disc CIMIL-CRC: A clinically-informed multiple instance learning framework for patient-level colorectal cancer molecular subtypes classification from H&E stained images CTGAN-driven synthetic data generation: A multidisciplinary, expert-guided approach (TIMA)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1