肿瘤临床试验中对照组生存数据的合成数据生成技术比较:模拟研究

IF 3.1 3区 医学 Q2 MEDICAL INFORMATICS JMIR Medical Informatics Pub Date : 2024-05-08 DOI:10.2196/55118
Ippei Akiya, Takuma Ishihara, Keiichi Yamamoto
{"title":"肿瘤临床试验中对照组生存数据的合成数据生成技术比较:模拟研究","authors":"Ippei Akiya, Takuma Ishihara, Keiichi Yamamoto","doi":"10.2196/55118","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Synthetic patient data (SPD) generation for survival analysis in oncology trials holds significant potential for accelerating clinical development. Various machine learning methods, including classification and regression trees (CART), random forest (RF), Bayesian network (BN), and CTGAN, have been employed for this purpose, but their performance in reflecting actual patient survival data remains under investigation.</p><p><strong>Objective: </strong>The aim of this study was to determine the most suitable SPD generation method for oncology trials, specifically focusing on both progression free survival (PFS) and overall survival (OS), which are the primary evaluation endpoints in oncology trials. To achieve this goal, we conducted a comparative simulation of 4 generation methods: CART, RF, BN, and the CTGAN, and the performance of each method was evaluated.</p><p><strong>Methods: </strong>Using multiple clinical trial datasets, 1000 datasets were generated by using each method for each clinical trial dataset and evaluated as follows: 1) median survival time (MST) of PFS and OS, 2) hazard ratio distance (HRD), which indicates the similarity between the actual survival function and a synthetic survival function, and 3) visual analysis of Kaplan‒Meier (KM) plots. Each method's ability to mimic the statistical properties of real patient data was evaluated from these multiple angles.</p><p><strong>Results: </strong>In most simulation cases, CART demonstrated the high percentages of MSTs of synthetic data falling within the range of 95% confidence interval (CI) of the MST of actual data. These percentages ranged from 88.8% to 98.0% for PFS and from 60.8% to 96.1% for OS. In the evaluation of HRD, CART demonstrated that HRD values were concentrated at approximately 0.9. Conversely, for the other methods, no consistent trend was observed for either PFS or OS. The reason why CART demonstrated better similarity than RF was that CART caused overfitting and RF, which is a kind of ensemble learning, prevented it. In SPD generation, the statistical properties close to the actual data should be the focus, not a well-generalized prediction model. Both the BN and CTGAN methods cannot accurately reflect the statistical properties of the actual data because small datasets are not suitable.</p><p><strong>Conclusions: </strong>As a method for generating SPD for survival data from small datasets, such as clinical trial data, CART demonstrated to be the most effective method compared to RF, BN, and CTGAN. Additionally, it is possible to improve CART-based generation methods by incorporating feature engineering and other methods in future work.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study.\",\"authors\":\"Ippei Akiya, Takuma Ishihara, Keiichi Yamamoto\",\"doi\":\"10.2196/55118\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Synthetic patient data (SPD) generation for survival analysis in oncology trials holds significant potential for accelerating clinical development. Various machine learning methods, including classification and regression trees (CART), random forest (RF), Bayesian network (BN), and CTGAN, have been employed for this purpose, but their performance in reflecting actual patient survival data remains under investigation.</p><p><strong>Objective: </strong>The aim of this study was to determine the most suitable SPD generation method for oncology trials, specifically focusing on both progression free survival (PFS) and overall survival (OS), which are the primary evaluation endpoints in oncology trials. To achieve this goal, we conducted a comparative simulation of 4 generation methods: CART, RF, BN, and the CTGAN, and the performance of each method was evaluated.</p><p><strong>Methods: </strong>Using multiple clinical trial datasets, 1000 datasets were generated by using each method for each clinical trial dataset and evaluated as follows: 1) median survival time (MST) of PFS and OS, 2) hazard ratio distance (HRD), which indicates the similarity between the actual survival function and a synthetic survival function, and 3) visual analysis of Kaplan‒Meier (KM) plots. Each method's ability to mimic the statistical properties of real patient data was evaluated from these multiple angles.</p><p><strong>Results: </strong>In most simulation cases, CART demonstrated the high percentages of MSTs of synthetic data falling within the range of 95% confidence interval (CI) of the MST of actual data. These percentages ranged from 88.8% to 98.0% for PFS and from 60.8% to 96.1% for OS. In the evaluation of HRD, CART demonstrated that HRD values were concentrated at approximately 0.9. Conversely, for the other methods, no consistent trend was observed for either PFS or OS. The reason why CART demonstrated better similarity than RF was that CART caused overfitting and RF, which is a kind of ensemble learning, prevented it. In SPD generation, the statistical properties close to the actual data should be the focus, not a well-generalized prediction model. Both the BN and CTGAN methods cannot accurately reflect the statistical properties of the actual data because small datasets are not suitable.</p><p><strong>Conclusions: </strong>As a method for generating SPD for survival data from small datasets, such as clinical trial data, CART demonstrated to be the most effective method compared to RF, BN, and CTGAN. Additionally, it is possible to improve CART-based generation methods by incorporating feature engineering and other methods in future work.</p>\",\"PeriodicalId\":56334,\"journal\":{\"name\":\"JMIR Medical Informatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/55118\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/55118","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

摘要

背景:在肿瘤试验中生成用于生存分析的合成患者数据(SPD)对于加速临床开发具有巨大潜力。为此,人们采用了多种机器学习方法,包括分类和回归树(CART)、随机森林(RF)、贝叶斯网络(BN)和 CTGAN,但这些方法在反映实际患者生存数据方面的性能仍有待研究:本研究的目的是确定最适合肿瘤试验的 SPD 生成方法,尤其关注肿瘤试验的主要评价终点--无进展生存期(PFS)和总生存期(OS)。为了实现这一目标,我们对 4 种生成方法进行了比较模拟:方法:使用多个临床试验数据集,针对每个临床试验数据集使用每种方法生成 1000 个数据集,并进行如下评估:1)PFS 和 OS 的平均生存时间(MST);2)危险比距离(HRD),表示实际生存函数与合成生存函数之间的相似性;3)Kaplan-Meier(KM)图的视觉分析。从这些角度对每种方法模拟真实患者数据统计特性的能力进行了评估:在大多数模拟病例中,CART 显示的 MSTS 在 MSTA 95% 置信区间 (CI) 范围内的百分比很高。PFS和OS的百分比分别为88.8%和98.0%,OS的百分比分别为60.8%和96.1%。在 HRD 评估中,CART 显示 HRD 值集中在 0.9 左右。相反,对于其他方法,PFS 和 OS 均未观察到一致的趋势。CART 比 RF 显示出更好的相似性,原因在于 CART 会导致过度拟合,而 RF 作为一种集合学习,可以防止过度拟合。在生成 SPD 时,重点应放在与实际数据相近的统计特性上,而不是一个通用的预测模型。BN和CTGAN方法都无法准确反映实际数据的统计特性,因为小数据集并不适合:作为从临床试验数据等小数据集生成生存数据 SPD 的方法,与 RF、BN 和 CTGAN 相比,CART 被证明是最有效的方法。此外,在未来的工作中,还可以通过结合特征工程和其他方法来改进基于 CART 的生成方法:
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study.

Background: Synthetic patient data (SPD) generation for survival analysis in oncology trials holds significant potential for accelerating clinical development. Various machine learning methods, including classification and regression trees (CART), random forest (RF), Bayesian network (BN), and CTGAN, have been employed for this purpose, but their performance in reflecting actual patient survival data remains under investigation.

Objective: The aim of this study was to determine the most suitable SPD generation method for oncology trials, specifically focusing on both progression free survival (PFS) and overall survival (OS), which are the primary evaluation endpoints in oncology trials. To achieve this goal, we conducted a comparative simulation of 4 generation methods: CART, RF, BN, and the CTGAN, and the performance of each method was evaluated.

Methods: Using multiple clinical trial datasets, 1000 datasets were generated by using each method for each clinical trial dataset and evaluated as follows: 1) median survival time (MST) of PFS and OS, 2) hazard ratio distance (HRD), which indicates the similarity between the actual survival function and a synthetic survival function, and 3) visual analysis of Kaplan‒Meier (KM) plots. Each method's ability to mimic the statistical properties of real patient data was evaluated from these multiple angles.

Results: In most simulation cases, CART demonstrated the high percentages of MSTs of synthetic data falling within the range of 95% confidence interval (CI) of the MST of actual data. These percentages ranged from 88.8% to 98.0% for PFS and from 60.8% to 96.1% for OS. In the evaluation of HRD, CART demonstrated that HRD values were concentrated at approximately 0.9. Conversely, for the other methods, no consistent trend was observed for either PFS or OS. The reason why CART demonstrated better similarity than RF was that CART caused overfitting and RF, which is a kind of ensemble learning, prevented it. In SPD generation, the statistical properties close to the actual data should be the focus, not a well-generalized prediction model. Both the BN and CTGAN methods cannot accurately reflect the statistical properties of the actual data because small datasets are not suitable.

Conclusions: As a method for generating SPD for survival data from small datasets, such as clinical trial data, CART demonstrated to be the most effective method compared to RF, BN, and CTGAN. Additionally, it is possible to improve CART-based generation methods by incorporating feature engineering and other methods in future work.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
JMIR Medical Informatics
JMIR Medical Informatics Medicine-Health Informatics
CiteScore
7.90
自引率
3.10%
发文量
173
审稿时长
12 weeks
期刊介绍: JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.
期刊最新文献
Comparative Study to Evaluate the Accuracy of Differential Diagnosis Lists Generated by Gemini Advanced, Gemini, and Bard for a Case Report Series Analysis: Cross-Sectional Study. Disambiguating Clinical Abbreviations by One-to-All Classification: Algorithm Development and Validation Study. Addressing Information Biases Within Electronic Health Record Data to Improve the Examination of Epidemiologic Associations With Diabetes Prevalence Among Young Adults: Cross-Sectional Study. Toward Better Semantic Interoperability of Data Element Repositories in Medicine: Analysis Study. Practical Aspects of Using Large Language Models to Screen Abstracts for Cardiovascular Drug Development: Cross-Sectional Study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1