Multi-modality risk prediction of cardiovascular diseases for breast cancer cohort in the All of Us Research Program.

IF 4.7 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of the American Medical Informatics Association Pub Date : 2024-07-26 DOI:10.1093/jamia/ocae199
Han Yang, Sicheng Zhou, Zexi Rao, Chen Zhao, Erjia Cui, Chetan Shenoy, Anne H Blaes, Nishitha Paidimukkala, Jinhua Wang, Jue Hou, Rui Zhang
{"title":"Multi-modality risk prediction of cardiovascular diseases for breast cancer cohort in the All of Us Research Program.","authors":"Han Yang, Sicheng Zhou, Zexi Rao, Chen Zhao, Erjia Cui, Chetan Shenoy, Anne H Blaes, Nishitha Paidimukkala, Jinhua Wang, Jue Hou, Rui Zhang","doi":"10.1093/jamia/ocae199","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>This study leverages the rich diversity of the All of Us Research Program (All of Us)'s dataset to devise a predictive model for cardiovascular disease (CVD) in breast cancer (BC) survivors. Central to this endeavor is the creation of a robust data integration pipeline that synthesizes electronic health records (EHRs), patient surveys, and genomic data, while upholding fairness across demographic variables.</p><p><strong>Materials and methods: </strong>We have developed a universal data wrangling pipeline to process and merge heterogeneous data sources of the All of Us dataset, address missingness and variance in data, and align disparate data modalities into a coherent framework for analysis. Utilizing a composite feature set including EHR, lifestyle, and social determinants of health (SDoH) data, we then employed Adaptive Lasso and Random Forest regression models to predict 6 CVD outcomes. The models were evaluated using the c-index and time-dependent Area Under the Receiver Operating Characteristic Curve over a 10-year period.</p><p><strong>Results: </strong>The Adaptive Lasso model showed consistent performance across most CVD outcomes, while the Random Forest model excelled particularly in predicting outcomes like transient ischemic attack when incorporating the full multi-model feature set. Feature importance analysis revealed age and previous coronary events as dominant predictors across CVD outcomes, with SDoH clustering labels highlighting the nuanced impact of social factors.</p><p><strong>Discussion: </strong>The development of both Cox-based predictive model and Random Forest Regression model represents the extensive application of the All of Us, in integrating EHR and patient surveys to enhance precision medicine. And the inclusion of SDoH clustering labels revealed the significant impact of sociobehavioral factors on patient outcomes, emphasizing the importance of comprehensive health determinants in predictive models. Despite these advancements, limitations include the exclusion of genetic data, broad categorization of CVD conditions, and the need for fairness analyses to ensure equitable model performance across diverse populations. Future work should refine clinical and social variable measurements, incorporate advanced imputation techniques, and explore additional predictive algorithms to enhance model precision and fairness.</p><p><strong>Conclusion: </strong>This study demonstrates the liability of the All of Us's diverse dataset in developing a multi-modality predictive model for CVD in BC survivors risk stratification in oncological survivorship. The data integration pipeline and subsequent predictive models establish a methodological foundation for future research into personalized healthcare.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":null,"pages":null},"PeriodicalIF":4.7000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocae199","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: This study leverages the rich diversity of the All of Us Research Program (All of Us)'s dataset to devise a predictive model for cardiovascular disease (CVD) in breast cancer (BC) survivors. Central to this endeavor is the creation of a robust data integration pipeline that synthesizes electronic health records (EHRs), patient surveys, and genomic data, while upholding fairness across demographic variables.

Materials and methods: We have developed a universal data wrangling pipeline to process and merge heterogeneous data sources of the All of Us dataset, address missingness and variance in data, and align disparate data modalities into a coherent framework for analysis. Utilizing a composite feature set including EHR, lifestyle, and social determinants of health (SDoH) data, we then employed Adaptive Lasso and Random Forest regression models to predict 6 CVD outcomes. The models were evaluated using the c-index and time-dependent Area Under the Receiver Operating Characteristic Curve over a 10-year period.

Results: The Adaptive Lasso model showed consistent performance across most CVD outcomes, while the Random Forest model excelled particularly in predicting outcomes like transient ischemic attack when incorporating the full multi-model feature set. Feature importance analysis revealed age and previous coronary events as dominant predictors across CVD outcomes, with SDoH clustering labels highlighting the nuanced impact of social factors.

Discussion: The development of both Cox-based predictive model and Random Forest Regression model represents the extensive application of the All of Us, in integrating EHR and patient surveys to enhance precision medicine. And the inclusion of SDoH clustering labels revealed the significant impact of sociobehavioral factors on patient outcomes, emphasizing the importance of comprehensive health determinants in predictive models. Despite these advancements, limitations include the exclusion of genetic data, broad categorization of CVD conditions, and the need for fairness analyses to ensure equitable model performance across diverse populations. Future work should refine clinical and social variable measurements, incorporate advanced imputation techniques, and explore additional predictive algorithms to enhance model precision and fairness.

Conclusion: This study demonstrates the liability of the All of Us's diverse dataset in developing a multi-modality predictive model for CVD in BC survivors risk stratification in oncological survivorship. The data integration pipeline and subsequent predictive models establish a methodological foundation for future research into personalized healthcare.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
全民研究计划中乳腺癌队列的心血管疾病多模式风险预测。
研究目的本研究利用 "我们所有人研究计划"(All of Us)数据集的丰富多样性,设计出乳腺癌(BC)幸存者心血管疾病(CVD)的预测模型。这项工作的核心是创建一个强大的数据集成管道,该管道可综合电子健康记录(EHR)、患者调查和基因组数据,同时维护不同人口统计学变量之间的公平性:我们开发了一个通用数据处理管道,用于处理和合并 "我们所有人 "数据集的异构数据源,解决数据缺失和数据差异问题,并将不同的数据模式整合到一个连贯的分析框架中。利用包括电子病历、生活方式和健康的社会决定因素 (SDoH) 数据在内的复合特征集,我们采用自适应拉索和随机森林回归模型来预测 6 种心血管疾病的结果。在 10 年的时间里,我们使用 c 指数和随时间变化的接收者工作特征曲线下面积对模型进行了评估:结果:自适应套索模型在大多数心血管疾病结果中表现出一致的性能,而随机森林模型在预测短暂性脑缺血发作等结果时表现尤为突出,因为它结合了完整的多模型特征集。特征重要性分析表明,年龄和既往冠心病事件是预测心血管疾病结果的主要因素,而SDoH聚类标签则突出了社会因素的细微影响:基于 Cox 的预测模型和随机森林回归模型的开发代表了 "我们所有人 "在整合电子病历和患者调查以提高精准医疗方面的广泛应用。SDoH聚类标签的加入揭示了社会行为因素对患者预后的重大影响,强调了预测模型中综合健康决定因素的重要性。尽管取得了这些进步,但仍存在一些局限性,包括未纳入基因数据、心血管疾病分类过宽,以及需要进行公平性分析以确保模型在不同人群中的公平表现。未来的工作应完善临床和社会变量测量,采用先进的估算技术,并探索更多的预测算法,以提高模型的精确性和公平性:本研究证明了 "我们所有人 "的多样化数据集在开发多模式预测模型以预测不列颠哥伦比亚省幸存者心血管疾病方面的作用。数据整合管道和后续预测模型为未来个性化医疗保健研究奠定了方法论基础。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association 医学-计算机:跨学科应用
CiteScore
14.50
自引率
7.80%
发文量
230
审稿时长
3-8 weeks
期刊介绍: JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.
期刊最新文献
A machine-learning prediction model to identify risk of firearm injury using electronic health records data. Cigarette smoking, e-cigarette use, and sociodemographic correlates of mental health and tobacco-related disease risk in the All of Us research program. CACER: Clinical concept Annotations for Cancer Events and Relations. Foundation model-driven distributed learning for enhanced retinal age prediction. Implementation and delivery of electronic health records training programs for nurses working in inpatient settings: a scoping review.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1