A comparative study of model-centric and data-centric approaches in the development of cardiovascular disease risk prediction models in the UK Biobank.

IF 4.4 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS European heart journal. Digital health Pub Date : 2023-05-15 eCollection Date: 2023-08-01 DOI:10.1093/ehjdh/ztad033

Mohammad Mamouei, Thomas Fisher, Shishir Rao, Yikuan Li, Ghomalreza Salimi-Khorshidi, Kazem Rahimi

{"title":"A comparative study of model-centric and data-centric approaches in the development of cardiovascular disease risk prediction models in the UK Biobank.","authors":"Mohammad Mamouei, Thomas Fisher, Shishir Rao, Yikuan Li, Ghomalreza Salimi-Khorshidi, Kazem Rahimi","doi":"10.1093/ehjdh/ztad033","DOIUrl":null,"url":null,"abstract":"Aims: A diverse set of factors influence cardiovascular diseases (CVDs), but a systematic investigation of the interplay between these determinants and the contribution of each to CVD incidence prediction is largely missing from the literature. In this study, we leverage one of the most comprehensive biobanks worldwide, the UK Biobank, to investigate the contribution of different risk factor categories to more accurate incidence predictions in the overall population, by sex, different age groups, and ethnicity.Methods and results: The investigated categories include the history of medical events, behavioural factors, socioeconomic factors, environmental factors, and measurements. We included data from a cohort of 405 257 participants aged 37-73 years and trained various machine learning and deep learning models on different subsets of risk factors to predict CVD incidence. Each of the models was trained on the complete set of predictors and subsets where each category was excluded. The results were benchmarked against QRISK3. The findings highlight that (i) leveraging a more comprehensive medical history substantially improves model performance. Relative to QRISK3, the best performing models improved the discrimination by 3.78% and improved precision by 1.80%. (ii) Both model- and data-centric approaches are necessary to improve predictive performance. The benefits of using a comprehensive history of diseases were far more pronounced when a neural sequence model, BEHRT, was used. This highlights the importance of the temporality of medical events that existing clinical risk models fail to capture. (iii) Besides the history of diseases, socioeconomic factors and measurements had small but significant independent contributions to the predictive performance.Conclusion: These findings emphasize the need for considering broad determinants and novel modelling approaches to enhance CVD incidence prediction.","PeriodicalId":72965,"journal":{"name":"European heart journal. Digital health","volume":"4 4","pages":"337-346"},"PeriodicalIF":4.4000,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/0e/a6/ztad033.PMC10393888.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European heart journal. Digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/ehjdh/ztad033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/8/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Aims: A diverse set of factors influence cardiovascular diseases (CVDs), but a systematic investigation of the interplay between these determinants and the contribution of each to CVD incidence prediction is largely missing from the literature. In this study, we leverage one of the most comprehensive biobanks worldwide, the UK Biobank, to investigate the contribution of different risk factor categories to more accurate incidence predictions in the overall population, by sex, different age groups, and ethnicity.

Methods and results: The investigated categories include the history of medical events, behavioural factors, socioeconomic factors, environmental factors, and measurements. We included data from a cohort of 405 257 participants aged 37-73 years and trained various machine learning and deep learning models on different subsets of risk factors to predict CVD incidence. Each of the models was trained on the complete set of predictors and subsets where each category was excluded. The results were benchmarked against QRISK3. The findings highlight that (i) leveraging a more comprehensive medical history substantially improves model performance. Relative to QRISK3, the best performing models improved the discrimination by 3.78% and improved precision by 1.80%. (ii) Both model- and data-centric approaches are necessary to improve predictive performance. The benefits of using a comprehensive history of diseases were far more pronounced when a neural sequence model, BEHRT, was used. This highlights the importance of the temporality of medical events that existing clinical risk models fail to capture. (iii) Besides the history of diseases, socioeconomic factors and measurements had small but significant independent contributions to the predictive performance.

Conclusion: These findings emphasize the need for considering broad determinants and novel modelling approaches to enhance CVD incidence prediction.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

英国生物库心血管疾病风险预测模型开发中以模型为中心和以数据为中心方法的比较研究。

目的：影响心血管疾病（CVDs）的因素多种多样，但文献中基本上没有系统地调查这些决定因素之间的相互作用以及每个因素对心血管疾病发病率预测的贡献。在这项研究中，我们利用全球最全面的生物库之一--英国生物库，按性别、不同年龄组和种族调查不同风险因素类别对更准确预测总体人群发病率的贡献：调查的类别包括医疗事件史、行为因素、社会经济因素、环境因素和测量。我们纳入了来自 405 257 名 37-73 岁参与者的队列数据，并针对不同的风险因素子集训练了各种机器学习和深度学习模型，以预测心血管疾病的发病率。每个模型都在完整的预测因子集和排除了每个类别的子集上进行了训练。结果以 QRISK3 为基准。研究结果表明：(i) 利用更全面的病史可大幅提高模型性能。与 QRISK3 相比，表现最好的模型的区分度提高了 3.78%，精确度提高了 1.80%。(ii) 要提高预测性能，必须同时采用以模型和数据为中心的方法。当使用神经序列模型 BEHRT 时，使用全面病史的好处要明显得多。这凸显了医疗事件时间性的重要性，而现有的临床风险模型未能捕捉到这一点。(iii) 除疾病史外，社会经济因素和测量对预测性能的独立贡献虽小，但也很重要：这些发现强调了考虑广泛的决定因素和新型建模方法以提高心血管疾病发病率预测的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

European heart journal. Digital health

CiteScore

5.00

自引率

0.00%

发文量