MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation.

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining Pub Date : 2024-01-01 DOI:10.1137/1.9781611978032.58

Yuan Zhong, Suhan Cui, Jiaqi Wang, Xiaochen Wang, Ziyi Yin, Yaqing Wang, Houping Xiao, Mengdi Huai, Ting Wang, Fenglong Ma

{"title":"MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation.","authors":"Yuan Zhong, Suhan Cui, Jiaqi Wang, Xiaochen Wang, Ziyi Yin, Yaqing Wang, Houping Xiao, Mengdi Huai, Ting Wang, Fenglong Ma","doi":"10.1137/1.9781611978032.58","DOIUrl":null,"url":null,"abstract":"<p><p>Health risk prediction aims to forecast the potential health risks that patients may face using their historical Electronic Health Records (EHR). Although several effective models have developed, data insufficiency is a key issue undermining their effectiveness. Various data generation and augmentation methods have been introduced to mitigate this issue by expanding the size of the training data set through learning underlying data distributions. However, the performance of these methods is often limited due to their task-unrelated design. To address these shortcomings, this paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. Furthermore, MedDiffusion discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data. Experimental evaluation on four real-world medical datasets demonstrates that MedDiffusion outperforms 14 cutting-edge baselines in terms of PR-AUC, F1, and Cohen's Kappa. We also conduct ablation studies and benchmark our model against GAN-based alternatives to further validate the rationality and adaptability of our model design. Additionally, we analyze generated data to offer fresh insights into the model's interpretability. The source code is available via https://shorturl.at/aerT0.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2024 ","pages":"499-507"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11469648/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/1.9781611978032.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Health risk prediction aims to forecast the potential health risks that patients may face using their historical Electronic Health Records (EHR). Although several effective models have developed, data insufficiency is a key issue undermining their effectiveness. Various data generation and augmentation methods have been introduced to mitigate this issue by expanding the size of the training data set through learning underlying data distributions. However, the performance of these methods is often limited due to their task-unrelated design. To address these shortcomings, this paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. Furthermore, MedDiffusion discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data. Experimental evaluation on four real-world medical datasets demonstrates that MedDiffusion outperforms 14 cutting-edge baselines in terms of PR-AUC, F1, and Cohen's Kappa. We also conduct ablation studies and benchmark our model against GAN-based alternatives to further validate the rationality and adaptability of our model design. Additionally, we analyze generated data to offer fresh insights into the model's interpretability. The source code is available via https://shorturl.at/aerT0.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MedDiffusion：通过基于扩散的数据扩增提升健康风险预测。

健康风险预测旨在利用患者的历史电子健康记录（EHR）预测患者可能面临的潜在健康风险。虽然已经开发出了一些有效的模型，但数据不足是影响其有效性的关键问题。为了缓解这一问题，人们引入了各种数据生成和增强方法，通过学习基础数据分布来扩大训练数据集的规模。然而，由于这些方法的设计与任务无关，其性能往往受到限制。为了解决这些缺陷，本文介绍了一种新颖的、基于端到端扩散的风险预测模型，命名为 MedDiffusion。它通过在训练过程中创建合成患者数据来扩大样本空间，从而提高风险预测性能。此外，MedDiffusion 还利用逐步关注机制来识别患者就诊之间的隐藏关系，使模型能够自动保留最重要的信息，从而生成高质量的数据。在四个真实世界医疗数据集上进行的实验评估表明，MedDiffusion 在 PR-AUC、F1 和 Cohen's Kappa 方面优于 14 个前沿基线。我们还进行了消融研究，并将我们的模型与基于 GAN 的替代模型进行了比较，从而进一步验证了我们模型设计的合理性和适应性。此外，我们还分析了生成的数据，为模型的可解释性提供了新的见解。源代码可通过 https://shorturl.at/aerT0 获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

自引率

0.00%

发文量