Orlando Parise , Rani Kronenberger , Gianmarco Parise , Carlo de Asmundis , Sandro Gelsomino , Mark La Meir
{"title":"CTGAN 驱动的合成数据生成:多学科专家指导方法(TIMA)","authors":"Orlando Parise , Rani Kronenberger , Gianmarco Parise , Carlo de Asmundis , Sandro Gelsomino , Mark La Meir","doi":"10.1016/j.cmpb.2024.108523","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>We generated synthetic data starting from a population of two hundred thirty-eight adults SARS-CoV-2 positive patients admitted to the University Hospital of Brussels, Belgium, in 2020, utilizing a Conditional Tabular Generative Adversarial Network (CTGAN)-based technique with the aim of testing the performance, representativeness, realism, novelty, and diversity of synthetic data generated from a small patient sample. A Multidisciplinary Approach (TIMA) incorporates active participation from a medical team throughout the various stages of this process.</div></div><div><h3>Methods</h3><div>The TIMA committee scrutinized data for inconsistencies, implementing stringent rules for variables unlearned by the system. A sensitivity analysis determined 100,000 epochs, leading to the generation of 10,000 synthetic data. The model's performance was tested using a general-purpose dataset, comparing real and synthetic data.</div></div><div><h3>Results</h3><div>Outcomes indicate the robustness of our model, with an average contingency score of 0.94 across variable pairs in synthetic and real data. Continuous variables exhibited a median correlation similarity score of 0.97. Novelty received a top score of 1. Principal Component Analysis (PCA) on synthetic values demonstrated diversity, as no patient pair displayed a zero or close-to-zero value distance. Remarkably, the TIMA committee's evaluation revealed that synthetic data was recognized as authentic by nearly 100%.</div></div><div><h3>Conclusions</h3><div>Our trained model exhibited commendable performance, yielding high representativeness in the synthetic dataset compared to the original. The synthetic dataset proved realistic, boasting elevated levels of novelty and diversity.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"259 ","pages":"Article 108523"},"PeriodicalIF":4.9000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CTGAN-driven synthetic data generation: A multidisciplinary, expert-guided approach (TIMA)\",\"authors\":\"Orlando Parise , Rani Kronenberger , Gianmarco Parise , Carlo de Asmundis , Sandro Gelsomino , Mark La Meir\",\"doi\":\"10.1016/j.cmpb.2024.108523\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><div>We generated synthetic data starting from a population of two hundred thirty-eight adults SARS-CoV-2 positive patients admitted to the University Hospital of Brussels, Belgium, in 2020, utilizing a Conditional Tabular Generative Adversarial Network (CTGAN)-based technique with the aim of testing the performance, representativeness, realism, novelty, and diversity of synthetic data generated from a small patient sample. A Multidisciplinary Approach (TIMA) incorporates active participation from a medical team throughout the various stages of this process.</div></div><div><h3>Methods</h3><div>The TIMA committee scrutinized data for inconsistencies, implementing stringent rules for variables unlearned by the system. A sensitivity analysis determined 100,000 epochs, leading to the generation of 10,000 synthetic data. The model's performance was tested using a general-purpose dataset, comparing real and synthetic data.</div></div><div><h3>Results</h3><div>Outcomes indicate the robustness of our model, with an average contingency score of 0.94 across variable pairs in synthetic and real data. Continuous variables exhibited a median correlation similarity score of 0.97. Novelty received a top score of 1. Principal Component Analysis (PCA) on synthetic values demonstrated diversity, as no patient pair displayed a zero or close-to-zero value distance. Remarkably, the TIMA committee's evaluation revealed that synthetic data was recognized as authentic by nearly 100%.</div></div><div><h3>Conclusions</h3><div>Our trained model exhibited commendable performance, yielding high representativeness in the synthetic dataset compared to the original. The synthetic dataset proved realistic, boasting elevated levels of novelty and diversity.</div></div>\",\"PeriodicalId\":10624,\"journal\":{\"name\":\"Computer methods and programs in biomedicine\",\"volume\":\"259 \",\"pages\":\"Article 108523\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2024-11-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer methods and programs in biomedicine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169260724005169\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260724005169","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
CTGAN-driven synthetic data generation: A multidisciplinary, expert-guided approach (TIMA)
Objective
We generated synthetic data starting from a population of two hundred thirty-eight adults SARS-CoV-2 positive patients admitted to the University Hospital of Brussels, Belgium, in 2020, utilizing a Conditional Tabular Generative Adversarial Network (CTGAN)-based technique with the aim of testing the performance, representativeness, realism, novelty, and diversity of synthetic data generated from a small patient sample. A Multidisciplinary Approach (TIMA) incorporates active participation from a medical team throughout the various stages of this process.
Methods
The TIMA committee scrutinized data for inconsistencies, implementing stringent rules for variables unlearned by the system. A sensitivity analysis determined 100,000 epochs, leading to the generation of 10,000 synthetic data. The model's performance was tested using a general-purpose dataset, comparing real and synthetic data.
Results
Outcomes indicate the robustness of our model, with an average contingency score of 0.94 across variable pairs in synthetic and real data. Continuous variables exhibited a median correlation similarity score of 0.97. Novelty received a top score of 1. Principal Component Analysis (PCA) on synthetic values demonstrated diversity, as no patient pair displayed a zero or close-to-zero value distance. Remarkably, the TIMA committee's evaluation revealed that synthetic data was recognized as authentic by nearly 100%.
Conclusions
Our trained model exhibited commendable performance, yielding high representativeness in the synthetic dataset compared to the original. The synthetic dataset proved realistic, boasting elevated levels of novelty and diversity.
期刊介绍:
To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine.
Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.